Bug 1982996
| Summary: | Ignore plug_vifs on the ironic driver | |||
|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Jaison Raju <jraju> | |
| Component: | openstack-nova | Assignee: | melanie witt <mwitt> | |
| Status: | CLOSED ERRATA | QA Contact: | OSP DFG:Compute <osp-dfg-compute> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 16.1 (Train) | CC: | alifshit, bdobreli, dasmith, eglynn, jhakimra, jkreger, jraju, kchamart, mwitt, sbauza, sgordon, vromanso | |
| Target Milestone: | z9 | Keywords: | Patch, Triaged | |
| Target Release: | 16.1 (Train on RHEL 8.2) | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | openstack-nova-20.4.1-1.20220524123356.1ee93b9.el8ost | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2085272 (view as bug list) | Environment: | ||
| Last Closed: | 2022-12-07 20:24:45 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 2085275 | |||
| Bug Blocks: | ||||
From my understanding init_host calls _init_instance on each instances. It doesn't look like this is multi-threaded unless I'm missing something. The conflict is kinda expected because _init_instance [1] tries to attach all VIF, there's no pre-validation, we just trap the conflict exception [2]. I don't see anything in place to prevent, throttle, cache or multithread these useless calls but I suspect the ironic calls are the ones slowing this down. We will discuss this on Wednesday triage call but adding Mel to the needinfo, for preparation since she helped on the previous issue. In the meantime Jaison, if you want to comment the lines in [2] from your environment and see if nova-compute is a lot faster, that would help us narrow down the source of slowness. Thanks, DVD [1] https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L1136-L1150 [2] https://github.com/openstack/nova/blob/master/nova/virt/ironic/driver.py#L1520-L1522 Commenting those lines end with service startup failures.
Although if I skip plug_vif then it reduced the time taken for _get_power_state & _plug_vifs from 45min to 2mins.
()[root@e16-h12-b01-fc640 /]# diff /usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py-bak /usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py
1683c1683
< self._plug_vifs(node, instance, network_info)
---
> # self._plug_vifs(node, instance, network_info)
Can we include some feature in nova for a configuration that will disable this for ironic nodes for only during service starts? So during nova instance build on ironic nodes, this configuration wont change the behavior of _plug_vif/attach_interface.
I am not sure if this will impact any functionality in case of ironic.
If we could get this, then we can have tripleo-common to enable this configuration in nova for director based deployments.
Entirely unrelated but also related oddly, I was sick and tired of not being able to schedule breakfast on the weekends with one of my best friends who operates a multi-thousand node baremetal cluster fronted by Nova, because they have to restart their nova-compute services almost every weekend. So I put up a patch upstream to disable the plug_vifs action as it originates from startup. I'm unsure if we could get it backported, and that is likely best answered by the compute team. Adding link for the upstream change set. Apologies, I somehow completely missed this needinfo for me :( (In reply to Julia Kreger from comment #5) > Entirely unrelated but also related oddly, I was sick and tired of not being > able to schedule breakfast on the weekends with one of my best friends who > operates a multi-thousand node baremetal cluster fronted by Nova, because > they have to restart their nova-compute services almost every weekend. So I > put up a patch upstream to disable the plug_vifs action as it originates > from startup. I'm unsure if we could get it backported, and that is likely > best answered by the compute team. Adding link for the upstream change set. I just had a look at the linked patch [1] and it looks backportable to me. If that patch will solve the issue in this rhbz, we can remove FutureFeature from this and address it as a bug fix backport. [1] https://review.opendev.org/c/openstack/nova/+/813263 (In reply to melanie witt from comment #6) > > If that patch will solve the issue in this rhbz, we can remove FutureFeature > from this and address it as a bug fix backport. > I do think so, given the timing in the notes indicated the vast majority of the time spent initializing was vif plugging, and trying to do parallel threaded cache creation just seems like a nightmare risk wise. I also know of a couple operators who have expressed intent to, or already have pulled that patch down into their internal forks/backport branches, so I *suspect* it does the needful and will make folks happier. I also know an operator who already intentionally short circuited it on their own, it only came up *after* I put the patch up. Jason, any opinion here? (In reply to Julia Kreger from comment #7) > (In reply to melanie witt from comment #6) > > > > > If that patch will solve the issue in this rhbz, we can remove FutureFeature > > from this and address it as a bug fix backport. > > > > I do think so, given the timing in the notes indicated the vast majority of > the time spent initializing was vif plugging, and trying to do parallel > threaded cache creation just seems like a nightmare risk wise. > > I also know of a couple operators who have expressed intent to, or already > have pulled that patch down into their internal forks/backport branches, so > I *suspect* it does the needful and will make folks happier. I also know an > operator who already intentionally short circuited it on their own, it only > came up *after* I put the patch up. > > Jason, any opinion here? To be clear, I have proposed backports for the patch upstream as I think the linked patch is an important fix for general operability. What I mean to ask here is whether that patch will solve this rhbz (if so, we don't need it to be an RFE and it is a bug). If it won't solve the issue in this rhbz, I will open a new rhbz to use for the backport activity and leave this one as an RFE for an enhancement that is required beyond the bug fix. (In reply to Julia Kreger from comment #7) > (In reply to melanie witt from comment #6) > > > > > If that patch will solve the issue in this rhbz, we can remove FutureFeature > > from this and address it as a bug fix backport. > > > > I do think so, given the timing in the notes indicated the vast majority of > the time spent initializing was vif plugging, and trying to do parallel > threaded cache creation just seems like a nightmare risk wise. > > I also know of a couple operators who have expressed intent to, or already > have pulled that patch down into their internal forks/backport branches, so > I *suspect* it does the needful and will make folks happier. I also know an > operator who already intentionally short circuited it on their own, it only > came up *after* I put the patch up. > > Jason, any opinion here? Yes. I agree. Thanks. parallelization may help in heavy computes with a large number of VMS, although we have not tested it. I will try to check this on 100 vm computes and raise a separate bug if parallelization is really required in general. Going with my usual "better to ask for forgiveness than for permission" approach, I'm going to remove the FutureFeature keyword, as it looks like the bugfix approach proposed by Melanie in comment #8 was accepted, and a separate BZ will be raised if we still need the original parallelization that was requested here. This bug was originally opened against OSP16.1, so I'm changing the flags and target milestone to 16.1 to make it easier for the bug reporter to track the version they're interested in. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.9 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:8795 |
Description of problem: On a 800 overcloud node scale environment, nova-compute restart causes the nova-compute service to remain in downtime for over 1hr. By observation, I understand that nova.compute.manager performs _get_power_state & vif plug for each nova instance (overcloud node) sequentially. This ends up taking very long time. In this meantime, nova-compute does not send heartbeat. Its observed that _plug_vifs itself takes ~4sec and it always fails with 409 conflict. I am wondering if this is necessary in openstack director environment: ... Error contacting Ironic server: Unable to attach VIF because VIF a4103080-688a-47d9-8be7-0dd13920cb24 is already attached to Iro nic Port 3d512ff2-b02d-4182-aab7-4a4e3242212c (HTTP 409) ... Version-Release number of selected component (if applicable): RHOS-16.1-RHEL-8-20210604.n.0 / 16.1_20210602.1 How reproducible: Always Steps to Reproduce: 1. 2. 3. Actual results: After nova_compute restart. The service is seen as down state for +60mins. Expected results: nova_compute binary should be up after service restart within permissive delay like 1min etc. Additional info: 2021-07-16 08:20:06.475 7 INFO nova.service [-] Starting compute node (version 20.4.1-1.20210409123502.1ee93b9.el8ost) 2021-07-16 08:20:06.520 7 DEBUG nova.servicegroup.drivers.db [req-93f8fc0b-35ad-47ba-a8a3-a5c46dda5a8d - - - - -] Seems service nova-compute on host e16-h12-b01-fc640.localdomain is down. Last heartbeat was 2021-07-16 06:38:39. Elapsed ti me is 6087.520855 is_up /usr/lib/python3.6/site-packages/nova/servicegroup/drivers/db.py:80 2021-07-16 08:20:06.521 7 DEBUG nova.virt.ironic.driver [req-93f8fc0b-35ad-47ba-a8a3-a5c46dda5a8d - - - - -] Hash ring members are {'e16-h12-b01-fc640.localdomain'} _refresh_hash_ring /usr/lib/python3.6/site-packages/nova/virt/ironic/driv er.py:795 2021-07-16 08:20:08.568 7 DEBUG nova.compute.manager [req-93f8fc0b-35ad-47ba-a8a3-a5c46dda5a8d - - - - -] [instance: 09c1a9c6-3e31-422c-9997-25d295213318] Checking state _get_power_state /usr/lib/python3.6/site-packages/nova/compute/manag er.py:1505 2021-07-16 08:20:08.979 7 DEBUG nova.virt.ironic.driver [req-93f8fc0b-35ad-47ba-a8a3-a5c46dda5a8d - - - - -] plug: instance_uuid=09c1a9c6-3e31-422c-9997-25d295213318 vif=[{"id": "a4103080-688a-47d9-8be7-0dd13920cb24", "address": "0c:c4:7a :19:6c:45", "network": {"id": "9daae6d4-7867-409d-8d9d-86e5b37631b4", "bridge": null, "label": "ctlplane", "subnets": [{"cidr": "192.168.0.0/16", "dns": [], "gateway": {"address": "192.168.0.1", "type": "gateway", "version": 4, "meta": {} }, "ips": [{"address": "192.168.1.108", "type": "fixed", "version": 4, "meta": {}, "floating_ips": []}], "routes": [], "version": 4, "meta": {"dhcp_server": "192.168.0.5"}}], "meta": {"injected": false, "tenant_id": "fb14d195d53740629d98c 24d8e45944f", "mtu": 9000, "physical_network": "ctlplane", "tunneled": false}}, "type": "other", "details": {"connectivity": "legacy"}, "devname": "tapa4103080-68", "ovs_interfaceid": null, "qbh_params": null, "qbg_params": null, "active" : true, "vnic_type": "baremetal", "profile": {}, "preserve_on_delete": true, "meta": {}}] _plug_vifs /usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py:1650 2021-07-16 08:20:09.284 7 DEBUG ironicclient.common.http [req-93f8fc0b-35ad-47ba-a8a3-a5c46dda5a8d - - - - -] Negotiated API version is 1.46 negotiate_version /usr/lib/python3.6/site-packages/ironicclient/common/http.py:244 2021-07-16 08:20:11.094 7 DEBUG ironicclient.common.http [req-93f8fc0b-35ad-47ba-a8a3-a5c46dda5a8d - - - - -] Error contacting Ironic server: Unable to attach VIF because VIF a4103080-688a-47d9-8be7-0dd13920cb24 is already attached to Iro nic Port 3d512ff2-b02d-4182-aab7-4a4e3242212c (HTTP 409). Attempt 1 of 2 wrapper /usr/lib/python3.6/site-packages/ironicclient/common/http.py:300 2021-07-16 08:20:12.937 7 ERROR ironicclient.common.http [req-93f8fc0b-35ad-47ba-a8a3-a5c46dda5a8d - - - - -] Error contacting Ironic server: Unable to attach VIF because VIF a4103080-688a-47d9-8be7-0dd13920cb24 is already attached to Iro nic Port 3d512ff2-b02d-4182-aab7-4a4e3242212c (HTTP 409). Attempt 2 of 2: ironicclient.common.apiclient.exceptions.Conflict: Unable to attach VIF because VIF a4103080-688a-47d9-8be7-0dd13920cb24 is already attached to Ironic Port 3d512ff2 -b02d-4182-aab7-4a4e3242212c (HTTP 409) 2021-07-16 08:20:12.938 7 DEBUG nova.compute.manager [req-93f8fc0b-35ad-47ba-a8a3-a5c46dda5a8d - - - - -] [instance: 09c1a9c6-3e31-422c-9997-25d295213318] Checking state _get_power_state /usr/lib/python3.6/site-packages/nova/compute/manag er.py:1505 2021-07-16 08:20:12.972 7 DEBUG nova.compute.manager [req-93f8fc0b-35ad-47ba-a8a3-a5c46dda5a8d - - - - -] [instance: 09c1a9c6-3e31-422c-9997-25d295213318] Current state is 1, state in DB is 1. _init_instance /usr/lib/python3.6/site-packag es/nova/compute/manager.py:1143