Bug 1867124
Summary: | [OSP13] Error in update_available_resources in nova compute resource tracker on sriov compute node | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Randy Rubins <rrubins> | |
Component: | openstack-nova | Assignee: | melanie witt <mwitt> | |
Status: | CLOSED ERRATA | QA Contact: | OSP DFG:Compute <osp-dfg-compute> | |
Severity: | high | Docs Contact: | ||
Priority: | medium | |||
Version: | 13.0 (Queens) | CC: | dasmith, eglynn, jhakimra, kchamart, mwitt, sbauza, sgordon, stephenfin, vromanso | |
Target Milestone: | z14 | Keywords: | Triaged, ZStream | |
Target Release: | 13.0 (Queens) | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | openstack-nova-17.0.13-30.el7ost | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1894277 (view as bug list) | Environment: | ||
Last Closed: | 2020-12-16 13:57:23 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1894277 | |||
Bug Blocks: |
Description
Randy Rubins
2020-08-07 12:01:48 UTC
So far, I'm not seeing how this could possibly be a race between update_available_resource and init_host. TBH, from code inspection I can't see how it could possibly happen at all. The update_available_resource method is called as part of a pre_start_hook [1] which runs before nova-compute fully comes up and starts serving requests. And the compute node record is fetched and/or created in update_available_resource [2][3] before the PCI tracker is constructed (in _setup_pci_tracker) and before PCI devices are created/saved in the _update method [4]. And the error trace shows the exception being raised from the _update method: " File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 7447, in update_available_resource_for_node", " rt.update_available_resource(context, nodename)", " File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 706, in update_available_resource", " self._update_available_resource(context, resources)", " File \"/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py\", line 274, in inner", " return f(*args, **kwargs)", " File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 782, in _update_available_resource", " self._update(context, cn)", " File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 926, in _update", For this error to happen, the PCI tracker would have had to have been constructed *before* update_available_resource ran. We will need more information to investigate further. Could you please attach the entire nova-compute.log containing the error to the case, preferably at DEBUG log level? [1] https://github.com/openstack/nova/blob/8f7dc3d8700a37956abf567d170ad42863baa4d7/nova/compute/manager.py#L1323-L1329 [2] https://github.com/openstack/nova/blob/stable/queens/nova/compute/resource_tracker.py#L740-L744 [3] https://github.com/openstack/nova/blob/stable/queens/nova/compute/resource_tracker.py#L539-L591 [4] https://github.com/openstack/nova/blob/stable/queens/nova/compute/resource_tracker.py#L795-L796 Hi Melanie! I've uploaded the full nova-compute logs (with debug enabled, in json format) to the case. There's also a nova db dump attached to the case if that might be useful. Please let me know if there's anything else you'd like me to gather from the environment. As mentioned before, the affected sriov nodes get into this state seemingly at random (although not very frequently) and this has occurred in several different environments. But it's always the sriov compute node that gets affected. Also, we know that a restart of nova-compute does resolve this particular issue. The issue has been reproduced the associated RH case #02722456 has been updated with nova-compute/libvirt logs and nova db dump. There are some interesting logs in between the 60-sec resource tracker runs where it switches from successful compute update to an error one. This issue was encountered while removing 20 sriov vms from the 2 sriov nodes, including the one that encountered the error. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (openstack-nova bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5578 |