Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1982996

Summary: Ignore plug_vifs on the ironic driver
Product: Red Hat OpenStack Reporter: Jaison Raju <jraju>
Component: openstack-novaAssignee: melanie witt <mwitt>
Status: CLOSED ERRATA QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: medium Docs Contact:
Priority: medium    
Version: 16.1 (Train)CC: alifshit, bdobreli, dasmith, eglynn, jhakimra, jkreger, jraju, kchamart, mwitt, sbauza, sgordon, vromanso
Target Milestone: z9Keywords: Patch, Triaged
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-nova-20.4.1-1.20220524123356.1ee93b9.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2085272 (view as bug list) Environment:
Last Closed: 2022-12-07 20:24:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2085275    
Bug Blocks:    

Description Jaison Raju 2021-07-16 08:39:22 UTC
Description of problem:
On a 800 overcloud node scale environment, nova-compute restart causes the nova-compute service to remain in downtime for over 1hr.
By observation, I understand that nova.compute.manager performs _get_power_state & vif plug for each nova instance (overcloud node) sequentially.
This ends up taking very long time.
In this meantime, nova-compute does not send heartbeat.
Its observed that _plug_vifs itself takes ~4sec and it always fails with 409 conflict. I am wondering if this is necessary in openstack director environment:
... Error contacting Ironic server: Unable to attach VIF because VIF a4103080-688a-47d9-8be7-0dd13920cb24 is already attached to Iro
nic Port 3d512ff2-b02d-4182-aab7-4a4e3242212c (HTTP 409) ...


Version-Release number of selected component (if applicable):
RHOS-16.1-RHEL-8-20210604.n.0 / 16.1_20210602.1

How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:
After nova_compute restart. The service is seen as down state for +60mins.

Expected results:
nova_compute binary should be up after service restart within permissive delay like 1min etc.

Additional info:



2021-07-16 08:20:06.475 7 INFO nova.service [-] Starting compute node (version 20.4.1-1.20210409123502.1ee93b9.el8ost)
2021-07-16 08:20:06.520 7 DEBUG nova.servicegroup.drivers.db [req-93f8fc0b-35ad-47ba-a8a3-a5c46dda5a8d - - - - -] Seems service nova-compute on host e16-h12-b01-fc640.localdomain is down. Last heartbeat was 2021-07-16 06:38:39. Elapsed ti
me is 6087.520855 is_up /usr/lib/python3.6/site-packages/nova/servicegroup/drivers/db.py:80
2021-07-16 08:20:06.521 7 DEBUG nova.virt.ironic.driver [req-93f8fc0b-35ad-47ba-a8a3-a5c46dda5a8d - - - - -] Hash ring members are {'e16-h12-b01-fc640.localdomain'} _refresh_hash_ring /usr/lib/python3.6/site-packages/nova/virt/ironic/driv
er.py:795
2021-07-16 08:20:08.568 7 DEBUG nova.compute.manager [req-93f8fc0b-35ad-47ba-a8a3-a5c46dda5a8d - - - - -] [instance: 09c1a9c6-3e31-422c-9997-25d295213318] Checking state _get_power_state /usr/lib/python3.6/site-packages/nova/compute/manag
er.py:1505
2021-07-16 08:20:08.979 7 DEBUG nova.virt.ironic.driver [req-93f8fc0b-35ad-47ba-a8a3-a5c46dda5a8d - - - - -] plug: instance_uuid=09c1a9c6-3e31-422c-9997-25d295213318 vif=[{"id": "a4103080-688a-47d9-8be7-0dd13920cb24", "address": "0c:c4:7a
:19:6c:45", "network": {"id": "9daae6d4-7867-409d-8d9d-86e5b37631b4", "bridge": null, "label": "ctlplane", "subnets": [{"cidr": "192.168.0.0/16", "dns": [], "gateway": {"address": "192.168.0.1", "type": "gateway", "version": 4, "meta": {}
}, "ips": [{"address": "192.168.1.108", "type": "fixed", "version": 4, "meta": {}, "floating_ips": []}], "routes": [], "version": 4, "meta": {"dhcp_server": "192.168.0.5"}}], "meta": {"injected": false, "tenant_id": "fb14d195d53740629d98c
24d8e45944f", "mtu": 9000, "physical_network": "ctlplane", "tunneled": false}}, "type": "other", "details": {"connectivity": "legacy"}, "devname": "tapa4103080-68", "ovs_interfaceid": null, "qbh_params": null, "qbg_params": null, "active"
: true, "vnic_type": "baremetal", "profile": {}, "preserve_on_delete": true, "meta": {}}] _plug_vifs /usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py:1650
2021-07-16 08:20:09.284 7 DEBUG ironicclient.common.http [req-93f8fc0b-35ad-47ba-a8a3-a5c46dda5a8d - - - - -] Negotiated API version is 1.46 negotiate_version /usr/lib/python3.6/site-packages/ironicclient/common/http.py:244
2021-07-16 08:20:11.094 7 DEBUG ironicclient.common.http [req-93f8fc0b-35ad-47ba-a8a3-a5c46dda5a8d - - - - -] Error contacting Ironic server: Unable to attach VIF because VIF a4103080-688a-47d9-8be7-0dd13920cb24 is already attached to Iro
nic Port 3d512ff2-b02d-4182-aab7-4a4e3242212c (HTTP 409). Attempt 1 of 2 wrapper /usr/lib/python3.6/site-packages/ironicclient/common/http.py:300
2021-07-16 08:20:12.937 7 ERROR ironicclient.common.http [req-93f8fc0b-35ad-47ba-a8a3-a5c46dda5a8d - - - - -] Error contacting Ironic server: Unable to attach VIF because VIF a4103080-688a-47d9-8be7-0dd13920cb24 is already attached to Iro
nic Port 3d512ff2-b02d-4182-aab7-4a4e3242212c (HTTP 409). Attempt 2 of 2: ironicclient.common.apiclient.exceptions.Conflict: Unable to attach VIF because VIF a4103080-688a-47d9-8be7-0dd13920cb24 is already attached to Ironic Port 3d512ff2
-b02d-4182-aab7-4a4e3242212c (HTTP 409)
2021-07-16 08:20:12.938 7 DEBUG nova.compute.manager [req-93f8fc0b-35ad-47ba-a8a3-a5c46dda5a8d - - - - -] [instance: 09c1a9c6-3e31-422c-9997-25d295213318] Checking state _get_power_state /usr/lib/python3.6/site-packages/nova/compute/manag
er.py:1505
2021-07-16 08:20:12.972 7 DEBUG nova.compute.manager [req-93f8fc0b-35ad-47ba-a8a3-a5c46dda5a8d - - - - -] [instance: 09c1a9c6-3e31-422c-9997-25d295213318] Current state is 1, state in DB is 1. _init_instance /usr/lib/python3.6/site-packag
es/nova/compute/manager.py:1143

Comment 3 David Vallee Delisle 2021-07-19 18:16:16 UTC
From my understanding init_host calls _init_instance on each instances. It doesn't look like this is multi-threaded unless I'm missing something.

The conflict is kinda expected because _init_instance [1] tries to attach all VIF, there's no pre-validation, we just trap the conflict exception [2].

I don't see anything in place to prevent, throttle, cache or multithread these useless calls but I suspect the ironic calls are the ones slowing this down.

We will discuss this on Wednesday triage call but adding Mel to the needinfo, for preparation since she helped on the previous issue.


In the meantime Jaison, if you want to comment the lines in [2] from your environment and see if nova-compute is a lot faster, that would help us narrow down the source of slowness.

Thanks,

DVD

[1] https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L1136-L1150
[2] https://github.com/openstack/nova/blob/master/nova/virt/ironic/driver.py#L1520-L1522

Comment 4 Jaison Raju 2021-07-22 05:46:23 UTC
Commenting those lines end with service startup failures.

Although if I skip plug_vif then it reduced the time taken for _get_power_state & _plug_vifs from 45min to 2mins.

()[root@e16-h12-b01-fc640 /]# diff /usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py-bak  /usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py
1683c1683
<         self._plug_vifs(node, instance, network_info)
---
> #        self._plug_vifs(node, instance, network_info)

Can we include some feature in nova for a configuration that will disable this for ironic nodes for only during service starts? So during nova instance build on ironic nodes, this configuration wont change the behavior of _plug_vif/attach_interface.
I am not sure if this will impact any functionality in case of ironic. 
If we could get this, then we can have tripleo-common to enable this configuration in nova for director based deployments.

Comment 5 Julia Kreger 2021-12-06 15:03:19 UTC
Entirely unrelated but also related oddly, I was sick and tired of not being able to schedule breakfast on the weekends with one of my best friends who operates a multi-thousand node baremetal cluster fronted by Nova, because they have to restart their nova-compute services almost every weekend. So I put up a patch upstream to disable the plug_vifs action as it originates from startup. I'm unsure if we could get it backported, and that is likely best answered by the compute team. Adding link for the upstream change set.

Comment 6 melanie witt 2021-12-10 01:14:51 UTC
Apologies, I somehow completely missed this needinfo for me :(

(In reply to Julia Kreger from comment #5)
> Entirely unrelated but also related oddly, I was sick and tired of not being
> able to schedule breakfast on the weekends with one of my best friends who
> operates a multi-thousand node baremetal cluster fronted by Nova, because
> they have to restart their nova-compute services almost every weekend. So I
> put up a patch upstream to disable the plug_vifs action as it originates
> from startup. I'm unsure if we could get it backported, and that is likely
> best answered by the compute team. Adding link for the upstream change set.

I just had a look at the linked patch [1] and it looks backportable to me.

If that patch will solve the issue in this rhbz, we can remove FutureFeature from this and address it as a bug fix backport.

[1] https://review.opendev.org/c/openstack/nova/+/813263

Comment 7 Julia Kreger 2021-12-10 16:02:25 UTC
(In reply to melanie witt from comment #6)

> 
> If that patch will solve the issue in this rhbz, we can remove FutureFeature
> from this and address it as a bug fix backport.
> 

I do think so, given the timing in the notes indicated the vast majority of the time spent initializing was vif plugging, and trying to do parallel threaded cache creation just seems like a nightmare risk wise.

I also know of a couple operators who have expressed intent to, or already have pulled that patch down into their internal forks/backport branches, so I *suspect* it does the needful and will make folks happier. I also know an operator who already intentionally short circuited it on their own, it only came up *after* I put the patch up.

Jason, any opinion here?

Comment 8 melanie witt 2021-12-10 20:27:42 UTC
(In reply to Julia Kreger from comment #7)
> (In reply to melanie witt from comment #6)
> 
> > 
> > If that patch will solve the issue in this rhbz, we can remove FutureFeature
> > from this and address it as a bug fix backport.
> > 
> 
> I do think so, given the timing in the notes indicated the vast majority of
> the time spent initializing was vif plugging, and trying to do parallel
> threaded cache creation just seems like a nightmare risk wise.
> 
> I also know of a couple operators who have expressed intent to, or already
> have pulled that patch down into their internal forks/backport branches, so
> I *suspect* it does the needful and will make folks happier. I also know an
> operator who already intentionally short circuited it on their own, it only
> came up *after* I put the patch up.
> 
> Jason, any opinion here?

To be clear, I have proposed backports for the patch upstream as I think the linked patch is an important fix for general operability.

What I mean to ask here is whether that patch will solve this rhbz (if so, we don't need it to be an RFE and it is a bug). If it won't solve the issue in this rhbz, I will open a new rhbz to use for the backport activity and leave this one as an RFE for an enhancement that is required beyond the bug fix.

Comment 10 Jaison Raju 2022-04-11 17:47:56 UTC
(In reply to Julia Kreger from comment #7)
> (In reply to melanie witt from comment #6)
> 
> > 
> > If that patch will solve the issue in this rhbz, we can remove FutureFeature
> > from this and address it as a bug fix backport.
> > 
> 
> I do think so, given the timing in the notes indicated the vast majority of
> the time spent initializing was vif plugging, and trying to do parallel
> threaded cache creation just seems like a nightmare risk wise.
> 
> I also know of a couple operators who have expressed intent to, or already
> have pulled that patch down into their internal forks/backport branches, so
> I *suspect* it does the needful and will make folks happier. I also know an
> operator who already intentionally short circuited it on their own, it only
> came up *after* I put the patch up.
> 
> Jason, any opinion here?

Yes. I agree. Thanks.
parallelization may help in heavy computes with a large number of VMS, although we have not tested it.
I will try to check this on 100 vm computes and raise a separate bug if parallelization is really required in general.

Comment 11 Artom Lifshitz 2022-04-21 15:55:40 UTC
Going with my usual "better to ask for forgiveness than for permission" approach, I'm going to remove the FutureFeature keyword, as it looks like the bugfix approach proposed by Melanie in comment #8 was accepted, and a separate BZ will be raised if we still need the original parallelization that was requested here.

Comment 14 melanie witt 2022-05-12 21:49:46 UTC
This bug was originally opened against OSP16.1, so I'm changing the flags and target milestone to 16.1 to make it easier for the bug reporter to track the version they're interested in.

Comment 23 errata-xmlrpc 2022-12-07 20:24:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.9 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8795