Bug 1867124

Summary:	[OSP13] Error in update_available_resources in nova compute resource tracker on sriov compute node
Product:	Red Hat OpenStack	Reporter:	Randy Rubins <rrubins>
Component:	openstack-nova	Assignee:	melanie witt <mwitt>
Status:	CLOSED ERRATA	QA Contact:	OSP DFG:Compute <osp-dfg-compute>
Severity:	high	Docs Contact:
Priority:	medium
Version:	13.0 (Queens)	CC:	dasmith, eglynn, jhakimra, kchamart, mwitt, sbauza, sgordon, stephenfin, vromanso
Target Milestone:	z14	Keywords:	Triaged, ZStream
Target Release:	13.0 (Queens)
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	openstack-nova-17.0.13-30.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1894277 (view as bug list)		Environment:
Last Closed:	2020-12-16 13:57:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1894277
Bug Blocks:

Description Randy Rubins 2020-08-07 12:01:48 UTC

Description of problem:
In customer's OSP13 environment, an sriov compute node after a random amount of time starts erroring on function "nova.compute.resource_tracker._update_available_resource".  Possible race condition between resource tracker and init_host. Requires further investigation.

Version-Release number of selected component (if applicable):
OSP13, openstack-nova-compute 13.0-98

How reproducible:
After successful start and operation of nova-compute, at some point the error updating resources is introduced with traceback listed below.

This error is persistent and can only be resolved by restarting openstack-nova-compute container.

Steps to Reproduce:
Start nova-compute on sriov compute node.
After some time observe the error.

Actual results:
  "traceback": [
    "Traceback (most recent call last):",
    "  File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 7447, in update_available_resource_for_node",
    "    rt.update_available_resource(context, nodename)",
    "  File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 706, in update_available_resource",
    "    self._update_available_resource(context, resources)",
    "  File \"/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py\", line 274, in inner",
    "    return f(*args, **kwargs)",
    "  File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 782, in _update_available_resource",
    "    self._update(context, cn)",
    "  File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 926, in _update",
    "    self.pci_tracker.save(context)",
    "  File \"/usr/lib/python2.7/site-packages/nova/pci/manager.py\", line 92, in save",
    "    dev.save()",
    "  File \"/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py\", line 210, in wrapper",
    "    ctxt, self, fn.__name__, args, kwargs)",
    "  File \"/usr/lib/python2.7/site-packages/nova/conductor/rpcapi.py\", line 245, in object_action",
    "    objmethod=objmethod, args=args, kwargs=kwargs)",
    "  File \"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py\", line 174, in call",
    "    retry=self.retry)",
    "  File \"/usr/lib/python2.7/site-packages/oslo_messaging/transport.py\", line 131, in _send",
    "    timeout=timeout, retry=retry)",
    "  File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 559, in send",
    "    retry=retry)",
    "  File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 550, in _send",
    "    raise result",
    "RemoteError: Remote error: DBError (pymysql.err.IntegrityError) (1048, u\"Column 'compute_node_id' cannot be null\") [SQL: u'INSERT INTO pci_devices (created_at, updated_at, deleted_at, deleted, uuid, compute_node_id, address, vendor_id, product_id, dev_type, dev_id, label, status, request_id, extra_info, instance_uuid, numa_node, parent_addr) VALUES (%(created_at)s, %(updated_at)s, %(deleted_at)s, %(deleted)s, %(uuid)s, %(compute_node_id)s, %(address)s, %(vendor_id)s, %(product_id)s, %(dev_type)s, %(dev_id)s, %(label)s, %(status)s, %(request_id)s, %(extra_info)s, %(instance_uuid)s, %(numa_node)s, %(parent_addr)s)'] [parameters: {'status': u'available', 'instance_uuid': None, 'dev_type': None, 'uuid': None, 'dev_id': None, 'parent_addr': None, 'numa_node': None, 'created_at': datetime.datetime(2020, 8, 7, 11, 51, 19, 643044), 'vendor_id': None, 'updated_at': None, 'label': None, 'deleted': 0, 'extra_info': '{}', 'compute_node_id': None, 'request_id': None, 'deleted_at': None, 'address': None, 'product_id': None}] (Background on this error at: http://sqlalche.me/e/gkpj)",


Expected results:
The 60s periodic resource tracker update of available resources on the sriov compute node should continue unimpeded.

Additional info:
This OSP13 environment has modified deployment architecture (k8s instead of tripleo).

Comment 2 melanie witt 2020-08-18 01:02:40 UTC

So far, I'm not seeing how this could possibly be a race between update_available_resource and init_host. TBH, from code inspection I can't see how it could possibly happen at all.

The update_available_resource method is called as part of a pre_start_hook [1] which runs before nova-compute fully comes up and starts serving requests. And the compute node record is fetched and/or created in update_available_resource [2][3] before the PCI tracker is constructed (in _setup_pci_tracker) and before PCI devices are created/saved in the _update method [4]. 

And the error trace shows the exception being raised from the _update method:

    "  File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 7447, in update_available_resource_for_node",
    "    rt.update_available_resource(context, nodename)",
    "  File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 706, in update_available_resource",
    "    self._update_available_resource(context, resources)",
    "  File \"/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py\", line 274, in inner",
    "    return f(*args, **kwargs)",
    "  File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 782, in _update_available_resource",
    "    self._update(context, cn)",
    "  File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 926, in _update",

For this error to happen, the PCI tracker would have had to have been constructed *before* update_available_resource ran.

We will need more information to investigate further. Could you please attach the entire nova-compute.log containing the error to the case, preferably at DEBUG log level?

[1] https://github.com/openstack/nova/blob/8f7dc3d8700a37956abf567d170ad42863baa4d7/nova/compute/manager.py#L1323-L1329
[2] https://github.com/openstack/nova/blob/stable/queens/nova/compute/resource_tracker.py#L740-L744
[3] https://github.com/openstack/nova/blob/stable/queens/nova/compute/resource_tracker.py#L539-L591
[4] https://github.com/openstack/nova/blob/stable/queens/nova/compute/resource_tracker.py#L795-L796

Comment 3 Randy Rubins 2020-08-18 11:47:06 UTC

Hi Melanie!
I've uploaded the full nova-compute logs (with debug enabled, in json format) to the case.  There's also a nova db dump attached to the case if that might be useful.  Please let me know if there's anything else you'd like me to gather from the environment.
As mentioned before, the affected sriov nodes get into this state seemingly at random (although not very frequently) and this has occurred in several different environments.  But it's always the sriov compute node that gets affected.  Also, we know that a restart of nova-compute does resolve this particular issue.

Comment 10 Randy Rubins 2020-09-14 11:20:11 UTC

The issue has been reproduced the associated RH case #02722456 has been updated with nova-compute/libvirt logs and nova db dump.
There are some interesting logs in between the 60-sec resource tracker runs where it switches from successful compute update to an error one.  This issue was encountered while removing 20 sriov vms from the 2 sriov nodes, including the one that encountered the error.

Comment 27 errata-xmlrpc 2020-12-16 13:57:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (openstack-nova bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5578