Bug 1889723
Summary: | [FFU OSP13 TO 16.1] Not able to spawn instances "Failed to synchronize the placement service with resource provider information supplied by the compute host" | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Ravi Singh <ravsingh> | |
Component: | openstack-nova | Assignee: | melanie witt <mwitt> | |
Status: | CLOSED ERRATA | QA Contact: | OSP DFG:Compute <osp-dfg-compute> | |
Severity: | medium | Docs Contact: | ||
Priority: | medium | |||
Version: | 16.1 (Train) | CC: | dasmith, eglynn, jhakimra, kchamart, mwitt, sbauza, sgordon, vromanso | |
Target Milestone: | z4 | Keywords: | Triaged, ZStream | |
Target Release: | 16.1 (Train on RHEL 8.2) | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | openstack-nova-20.4.1-1.20201114041747.el8ost | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1904281 (view as bug list) | Environment: | ||
Last Closed: | 2021-03-17 15:33:11 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1904281 | |||
Bug Blocks: |
Description
Ravi Singh
2020-10-20 13:07:41 UTC
Let me know if any logs are required . (In reply to Ravi Singh from comment #0) > Description of problem: > > Post FFU[OSP13 to 16.1] neither able to spawn new instances nor able to > restart old instances. > > 1. For new instance getting this error in the nova-conductor log which is > in sync with something observed in nova-compute too[1] > > 2. Galera & rabbit are in good condition. > > ~~~ > Full List of Resources: > * Container bundle set: galera-bundle > [cluster.common.tag/openstack-mariadb:pcmklatest]: > * galera-bundle-0 (ocf::heartbeat:galera): Master overcloud-controller-0 > * galera-bundle-1 (ocf::heartbeat:galera): Master overcloud-controller-1 > * galera-bundle-2 (ocf::heartbeat:galera): Master overcloud-controller-2 > * Container bundle set: rabbitmq-bundle > [cluster.common.tag/openstack-rabbitmq:pcmklatest]: > * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started > overcloud-controller-0 > * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Stopped > overcloud-controller-1 > * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started > overcloud-controller-2 > ~~~ > > 3. All nova containers are in healthy condition. > 4. nova-compute service is UP. > 5. MySQL related errors observed but didn't re-appear post containers > restart/ > > > ~~~ > 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines result.read() > 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines File > "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 1349, in read > 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines first_packet > = self.connection._read_packet() > 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines File > "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 991, in > _read_packet > 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines > packet_header = self._read_bytes(4) > 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines File > "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 1037, in > _read_bytes > 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines > CR.CR_SERVER_LOST, "Lost connection to MySQL server during query") > 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines > oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, > 'Lost connection to MySQL server during query') > 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines [SQL: SELECT 1] > 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines (Background on > this error at: http://sqlalche.me/e/e3q8) > 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines > ~~~ As you have mentioned, this error ^ is indicating a loss of connection to the database. > [1] > ~~~ > > ... > > nova-compute > > > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager > nova.exception.ResourceProviderUpdateFailed: Failed to update resource > provider via URL > /resource_providers/5812a7ab-f8e7-4c57-aba8-f283fedfd04b/inventories: > {"errors": [{"status": 400, "title": "Bad Request", "detail": "The server > could not comply with the request since it is either malformed or otherwise > incorrect.\n\n JSON does not validate: 0 is less than the minimum of 1 > Failed validating 'minimum' in > schema['properties']['inventories']['patternProperties']['^[A-Z0- > 9_]+$']['properties']['total']: {'maximum': 2147483647, 'minimum': 1, > 'type': 'integer'} On instance['inventories']['DISK_GB']['total']: 0 > ", "code": "placement.undefined_code", "request_id": > "req-7fc88053-8e43-4c18-9810-73ad05c3a993"}]} This error ResourceProviderUpdateFailed ^ is saying that an attempt to update inventories on resource provider (compute node) 5812a7ab-f8e7-4c57-aba8-f283fedfd04b failed because of an attempt to set 'total' of resource 'DISK_GB' equal to 0. The request schema requires a minimum value of 1. > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager During handling of the > above exception, another exception occurred: > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager Traceback (most recent > call last): > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 8740, in > _update_available_resource_for_node > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager startup=startup) > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line > 887, in update_available_resource > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager > self._update_available_resource(context, resources, startup=startup) > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 328, > in inner > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager return f(*args, > **kwargs) > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line > 972, in _update_available_resource > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager > self._update(context, cn, startup=startup) > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line > 1237, in _update > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager > self._update_to_placement(context, compute_node, startup) > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/retrying.py", line 68, in wrapped_f > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager return > Retrying(*dargs, **dkw).call(f, *args, **kw) > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/retrying.py", line 223, in call > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager return > attempt.get(self._wrap_exception) > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/retrying.py", line 261, in get > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager > six.reraise(self.value[0], self.value[1], self.value[2]) > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager raise value > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/retrying.py", line 217, in call > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager attempt = > Attempt(fn(*args, **kwargs), attempt_number, False) > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line > 1215, in _update_to_placement > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager allocations=allocs) > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/nova/scheduler/client/report.py", line > 1421, in update_from_provider_tree > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager > self.set_traits_for_provider(context, pd.uuid, pd.traits) > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib64/python3.6/contextlib.py", line 99, in __exit__ > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager > self.gen.throw(type, value, traceback) > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/nova/scheduler/client/report.py", line > 1341, in catch_all > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager raise > exception.ResourceProviderSyncFailed() > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager > nova.exception.ResourceProviderSyncFailed: Failed to synchronize the > placement service with resource provider information supplied by the compute > host. > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager > ~~~ And while handling ResourceProviderUpdateFailed, the cache is cleared and then ResourceProviderSyncFailed is raised. The bug is that the libvirt driver is reporting 'total' for 'DISK_GB' as 0 when the placement API does not allow this. This bug also exists on the master branch. Instead, for all resources we need to check whether there is > 0 available and only report them if that is the case. Else, we should not include them in the PUT request, which will result in the resource inventory being removed from placement (as PUT does a full replace [2]), which is the way that placement is intended to work. We are already doing this for other resource classes like 'VCPU' [3]. [1] https://github.com/openstack/nova/blob/17a233cf77f811f278dfb4b64dd0b0d76ad4b719/nova/scheduler/client/report.py#L1322-L1341 [2] https://docs.openstack.org/api-ref/placement/?expanded=#update-resource-provider-inventory [3] https://github.com/openstack/nova/blob/6a5e158756edc6c01bb23c26dcfc4c3a3df703dc/nova/virt/libvirt/driver.py#L7718 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.4 director bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0817 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.4 director bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0817 |