1867124 – [OSP13] Error in update_available_resources in nova compute resource tracker on sriov compute node

Bug 1867124 - [OSP13] Error in update_available_resources in nova compute resource tracker on sriov compute node

Summary: [OSP13] Error in update_available_resources in nova compute resource tracker ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	13.0 (Queens)
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	z14
Target Release:	13.0 (Queens)
Assignee:	melanie witt
QA Contact:	OSP DFG:Compute
Docs Contact:
URL:
Whiteboard:
Depends On:	1894277
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-07 12:01 UTC by Randy Rubins
Modified:	2023-03-21 19:34 UTC (History)
CC List:	9 users (show)
Fixed In Version:	openstack-nova-17.0.13-30.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1894277 (view as bug list)
Environment:
Last Closed:	2020-12-16 13:57:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1899541	None	None	None	2020-10-12 22:49:30 UTC
OpenStack gerrit	757656	None	MERGED	Prevent archiving of pci_devices records because of 'instance_uuid'	2020-12-09 18:47:11 UTC
Red Hat Product Errata	RHBA-2020:5578	None	None	None	2020-12-16 13:57:34 UTC

Description Randy Rubins 2020-08-07 12:01:48 UTC

Description of problem:
In customer's OSP13 environment, an sriov compute node after a random amount of time starts erroring on function "nova.compute.resource_tracker._update_available_resource".  Possible race condition between resource tracker and init_host. Requires further investigation.

Version-Release number of selected component (if applicable):
OSP13, openstack-nova-compute 13.0-98

How reproducible:
After successful start and operation of nova-compute, at some point the error updating resources is introduced with traceback listed below.

This error is persistent and can only be resolved by restarting openstack-nova-compute container.

Steps to Reproduce:
Start nova-compute on sriov compute node.
After some time observe the error.

Actual results:
  "traceback": [
    "Traceback (most recent call last):",
    "  File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 7447, in update_available_resource_for_node",
    "    rt.update_available_resource(context, nodename)",
    "  File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 706, in update_available_resource",
    "    self._update_available_resource(context, resources)",
    "  File \"/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py\", line 274, in inner",
    "    return f(*args, **kwargs)",
    "  File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 782, in _update_available_resource",
    "    self._update(context, cn)",
    "  File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 926, in _update",
    "    self.pci_tracker.save(context)",
    "  File \"/usr/lib/python2.7/site-packages/nova/pci/manager.py\", line 92, in save",
    "    dev.save()",
    "  File \"/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py\", line 210, in wrapper",
    "    ctxt, self, fn.__name__, args, kwargs)",
    "  File \"/usr/lib/python2.7/site-packages/nova/conductor/rpcapi.py\", line 245, in object_action",
    "    objmethod=objmethod, args=args, kwargs=kwargs)",
    "  File \"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py\", line 174, in call",
    "    retry=self.retry)",
    "  File \"/usr/lib/python2.7/site-packages/oslo_messaging/transport.py\", line 131, in _send",
    "    timeout=timeout, retry=retry)",
    "  File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 559, in send",
    "    retry=retry)",
    "  File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 550, in _send",
    "    raise result",
    "RemoteError: Remote error: DBError (pymysql.err.IntegrityError) (1048, u\"Column 'compute_node_id' cannot be null\") [SQL: u'INSERT INTO pci_devices (created_at, updated_at, deleted_at, deleted, uuid, compute_node_id, address, vendor_id, product_id, dev_type, dev_id, label, status, request_id, extra_info, instance_uuid, numa_node, parent_addr) VALUES (%(created_at)s, %(updated_at)s, %(deleted_at)s, %(deleted)s, %(uuid)s, %(compute_node_id)s, %(address)s, %(vendor_id)s, %(product_id)s, %(dev_type)s, %(dev_id)s, %(label)s, %(status)s, %(request_id)s, %(extra_info)s, %(instance_uuid)s, %(numa_node)s, %(parent_addr)s)'] [parameters: {'status': u'available', 'instance_uuid': None, 'dev_type': None, 'uuid': None, 'dev_id': None, 'parent_addr': None, 'numa_node': None, 'created_at': datetime.datetime(2020, 8, 7, 11, 51, 19, 643044), 'vendor_id': None, 'updated_at': None, 'label': None, 'deleted': 0, 'extra_info': '{}', 'compute_node_id': None, 'request_id': None, 'deleted_at': None, 'address': None, 'product_id': None}] (Background on this error at: http://sqlalche.me/e/gkpj)",


Expected results:
The 60s periodic resource tracker update of available resources on the sriov compute node should continue unimpeded.

Additional info:
This OSP13 environment has modified deployment architecture (k8s instead of tripleo).

Comment 2 melanie witt 2020-08-18 01:02:40 UTC

So far, I'm not seeing how this could possibly be a race between update_available_resource and init_host. TBH, from code inspection I can't see how it could possibly happen at all.

The update_available_resource method is called as part of a pre_start_hook [1] which runs before nova-compute fully comes up and starts serving requests. And the compute node record is fetched and/or created in update_available_resource [2][3] before the PCI tracker is constructed (in _setup_pci_tracker) and before PCI devices are created/saved in the _update method [4]. 

And the error trace shows the exception being raised from the _update method:

    "  File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 7447, in update_available_resource_for_node",
    "    rt.update_available_resource(context, nodename)",
    "  File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 706, in update_available_resource",
    "    self._update_available_resource(context, resources)",
    "  File \"/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py\", line 274, in inner",
    "    return f(*args, **kwargs)",
    "  File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 782, in _update_available_resource",
    "    self._update(context, cn)",
    "  File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 926, in _update",

For this error to happen, the PCI tracker would have had to have been constructed *before* update_available_resource ran.

We will need more information to investigate further. Could you please attach the entire nova-compute.log containing the error to the case, preferably at DEBUG log level?

[1] https://github.com/openstack/nova/blob/8f7dc3d8700a37956abf567d170ad42863baa4d7/nova/compute/manager.py#L1323-L1329
[2] https://github.com/openstack/nova/blob/stable/queens/nova/compute/resource_tracker.py#L740-L744
[3] https://github.com/openstack/nova/blob/stable/queens/nova/compute/resource_tracker.py#L539-L591
[4] https://github.com/openstack/nova/blob/stable/queens/nova/compute/resource_tracker.py#L795-L796

Comment 3 Randy Rubins 2020-08-18 11:47:06 UTC

Hi Melanie!
I've uploaded the full nova-compute logs (with debug enabled, in json format) to the case.  There's also a nova db dump attached to the case if that might be useful.  Please let me know if there's anything else you'd like me to gather from the environment.
As mentioned before, the affected sriov nodes get into this state seemingly at random (although not very frequently) and this has occurred in several different environments.  But it's always the sriov compute node that gets affected.  Also, we know that a restart of nova-compute does resolve this particular issue.

Comment 10 Randy Rubins 2020-09-14 11:20:11 UTC

The issue has been reproduced the associated RH case #02722456 has been updated with nova-compute/libvirt logs and nova db dump.
There are some interesting logs in between the 60-sec resource tracker runs where it switches from successful compute update to an error one.  This issue was encountered while removing 20 sriov vms from the 2 sriov nodes, including the one that encountered the error.

Comment 27 errata-xmlrpc 2020-12-16 13:57:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (openstack-nova bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5578

Note You need to log in before you can comment on or make changes to this bug.