Description of problem: Post FFU[OSP13 to 16.1] neither able to spawn new instances nor able to restart old instances. 1. For new instance getting this error in the nova-conductor log which is in sync with something observed in nova-compute too[1] 2. Galera & rabbit are in good condition. ~~~ Full List of Resources: * Container bundle set: galera-bundle [cluster.common.tag/openstack-mariadb:pcmklatest]: * galera-bundle-0 (ocf::heartbeat:galera): Master overcloud-controller-0 * galera-bundle-1 (ocf::heartbeat:galera): Master overcloud-controller-1 * galera-bundle-2 (ocf::heartbeat:galera): Master overcloud-controller-2 * Container bundle set: rabbitmq-bundle [cluster.common.tag/openstack-rabbitmq:pcmklatest]: * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-0 * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Stopped overcloud-controller-1 * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-2 ~~~ 3. All nova containers are in healthy condition. 4. nova-compute service is UP. 5. MySQL related errors observed but didn't re-appear post containers restart/ ~~~ 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines result.read() 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 1349, in read 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines first_packet = self.connection._read_packet() 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 991, in _read_packet 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines packet_header = self._read_bytes(4) 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 1037, in _read_bytes 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines CR.CR_SERVER_LOST, "Lost connection to MySQL server during query") 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines [SQL: SELECT 1] 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines (Background on this error at: http://sqlalche.me/e/e3q8) 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines ~~~ [1] ~~~ /var/log/containers/nova/nova-conductor.log:2020-10-20 12:26:33.412 25 ERROR nova.scheduler.utils [req-881321df-b86e-4d8a-a3ed-263a4ce7590a 4c77c40906394a83b43160ad2d9ca766 93d38a65596945f3ae874746a050d8a5 - default default] [instance: 43d28fd4-3205-485d-a5be-5e082c30f579] Error from last host: overcloud-compute-1.localdomain (node overcloud-compute-1.localdomain): ['Traceback (most recent call last):\n', ' File "/usr/lib/python3.6/site-packages/nova/scheduler/client/report.py", line 1329, in catch_all\n yield\n', ' File "/usr/lib/python3.6/site-packages/nova/scheduler/client/report.py", line 1418, in update_from_provider_tree\n context, pd.uuid, pd.inventory)\n', ' File "/usr/lib/python3.6/site-packages/nova/scheduler/client/report.py", line 950, in set_inventory_for_provider\n raise exception.ResourceProviderUpdateFailed(url=url, error=resp.text)\n', 'nova.exception.ResourceProviderUpdateFailed: Failed to update resource provider via URL /resource_providers/5812a7ab-f8e7-4c57-aba8-f283fedfd04b/inventories: {"errors": [{"status": 400, "title": "Bad Request", "detail": "The server could not comply with the request since it is either malformed or otherwise incorrect.\\n\\n JSON does not validate: 0 is less than the minimum of 1 Failed validating \'minimum\' in schema[\'properties\'][\'inventories\'][\'patternProperties\'][\'^[A-Z0-9_]+$\'][\'properties\'][\'total\']: {\'maximum\': 2147483647, \'minimum\': 1, \'type\': \'integer\'} On instance[\'inventories\'][\'DISK_GB\'][\'total\']: 0 ", "code": "placement.undefined_code", "request_id": "req-fba91085-f470-4c85-9183-29b1d7cae026"}]}\n', '\nDuring handling of the above exception, another exception occurred:\n\n', 'Traceback (most recent call last):\n', ' File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2407, in _build_and_run_instance\n limits):\n', ' File "/usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 328, in inner\n return f(*args, **kwargs)\n', ' File "/usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 232, in instance_claim\n self._update(elevated, cn)\n', ' File "/usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 1237, in _update\n self._update_to_placement(context, compute_node, startup)\n', ' File "/usr/lib/python3.6/site-packages/retrying.py", line 68, in wrapped_f\n return Retrying(*dargs, **dkw).call(f, *args, **kw)\n', ' File "/usr/lib/python3.6/site-packages/retrying.py", line 223, in call\n return attempt.get(self._wrap_exception)\n', ' File "/usr/lib/python3.6/site-packages/retrying.py", line 261, in get\n six.reraise(self.value[0], self.value[1], self.value[2])\n', ' File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise\n raise value\n', ' File "/usr/lib/python3.6/site-packages/retrying.py", line 217, in call\n attempt = Attempt(fn(*args, **kwargs), attempt_number, False)\n', ' File "/usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 1215, in _update_to_placement\n allocations=allocs)\n', ' File "/usr/lib/python3.6/site-packages/nova/scheduler/client/report.py", line 1421, in update_from_provider_tree\n self.set_traits_for_provider(context, pd.uuid, pd.traits)\n', ' File "/usr/lib64/python3.6/contextlib.py", line 99, in __exit__\n self.gen.throw(type, value, traceback)\n', ' File "/usr/lib/python3.6/site-packages/nova/scheduler/client/report.py", line 1341, in catch_all\n raise exception.ResourceProviderSyncFailed()\n', 'nova.exception.ResourceProviderSyncFailed: Failed to synchronize the placement service with resource provider information supplied by the compute host.\n', '\nDuring handling of the above exception, another exception occurred:\n\n', 'Traceback (most recent call last):\n', ' File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2161, in _do_build_and_run_instance\n filter_properties, request_spec)\n', ' File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2537, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n', 'nova.exception.RescheduledException: Build of instance 43d28fd4-3205-485d-a5be-5e082c30f579 was re-scheduled: Failed to synchronize the placement service with resource provider information supplied by the compute host.\n'] nova-compute 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager nova.exception.ResourceProviderUpdateFailed: Failed to update resource provider via URL /resource_providers/5812a7ab-f8e7-4c57-aba8-f283fedfd04b/inventories: {"errors": [{"status": 400, "title": "Bad Request", "detail": "The server could not comply with the request since it is either malformed or otherwise incorrect.\n\n JSON does not validate: 0 is less than the minimum of 1 Failed validating 'minimum' in schema['properties']['inventories']['patternProperties']['^[A-Z0-9_]+$']['properties']['total']: {'maximum': 2147483647, 'minimum': 1, 'type': 'integer'} On instance['inventories']['DISK_GB']['total']: 0 ", "code": "placement.undefined_code", "request_id": "req-7fc88053-8e43-4c18-9810-73ad05c3a993"}]} 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager During handling of the above exception, another exception occurred: 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager Traceback (most recent call last): 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 8740, in _update_available_resource_for_node 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager startup=startup) 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 887, in update_available_resource 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager self._update_available_resource(context, resources, startup=startup) 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 328, in inner 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager return f(*args, **kwargs) 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 972, in _update_available_resource 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager self._update(context, cn, startup=startup) 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 1237, in _update 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager self._update_to_placement(context, compute_node, startup) 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/retrying.py", line 68, in wrapped_f 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager return Retrying(*dargs, **dkw).call(f, *args, **kw) 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/retrying.py", line 223, in call 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager return attempt.get(self._wrap_exception) 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/retrying.py", line 261, in get 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager six.reraise(self.value[0], self.value[1], self.value[2]) 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager raise value 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/retrying.py", line 217, in call 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager attempt = Attempt(fn(*args, **kwargs), attempt_number, False) 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 1215, in _update_to_placement 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager allocations=allocs) 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/nova/scheduler/client/report.py", line 1421, in update_from_provider_tree 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager self.set_traits_for_provider(context, pd.uuid, pd.traits) 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File "/usr/lib64/python3.6/contextlib.py", line 99, in __exit__ 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager self.gen.throw(type, value, traceback) 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File "/usr/lib/python3.6/site-packages/nova/scheduler/client/report.py", line 1341, in catch_all 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager raise exception.ResourceProviderSyncFailed() 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager nova.exception.ResourceProviderSyncFailed: Failed to synchronize the placement service with resource provider information supplied by the compute host. 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager ~~~ Version-Release number of selected component (if applicable): OSP16.1 How reproducible: 100% Steps to Reproduce: 1. 2. 3. Actual results: Not able to spawn or restart old instances. Expected results: Should be able to restart. Additional info:
Let me know if any logs are required .
(In reply to Ravi Singh from comment #0) > Description of problem: > > Post FFU[OSP13 to 16.1] neither able to spawn new instances nor able to > restart old instances. > > 1. For new instance getting this error in the nova-conductor log which is > in sync with something observed in nova-compute too[1] > > 2. Galera & rabbit are in good condition. > > ~~~ > Full List of Resources: > * Container bundle set: galera-bundle > [cluster.common.tag/openstack-mariadb:pcmklatest]: > * galera-bundle-0 (ocf::heartbeat:galera): Master overcloud-controller-0 > * galera-bundle-1 (ocf::heartbeat:galera): Master overcloud-controller-1 > * galera-bundle-2 (ocf::heartbeat:galera): Master overcloud-controller-2 > * Container bundle set: rabbitmq-bundle > [cluster.common.tag/openstack-rabbitmq:pcmklatest]: > * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started > overcloud-controller-0 > * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Stopped > overcloud-controller-1 > * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started > overcloud-controller-2 > ~~~ > > 3. All nova containers are in healthy condition. > 4. nova-compute service is UP. > 5. MySQL related errors observed but didn't re-appear post containers > restart/ > > > ~~~ > 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines result.read() > 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines File > "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 1349, in read > 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines first_packet > = self.connection._read_packet() > 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines File > "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 991, in > _read_packet > 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines > packet_header = self._read_bytes(4) > 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines File > "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 1037, in > _read_bytes > 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines > CR.CR_SERVER_LOST, "Lost connection to MySQL server during query") > 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines > oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, > 'Lost connection to MySQL server during query') > 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines [SQL: SELECT 1] > 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines (Background on > this error at: http://sqlalche.me/e/e3q8) > 2020-10-20 11:38:41.465 24 ERROR oslo_db.sqlalchemy.engines > ~~~ As you have mentioned, this error ^ is indicating a loss of connection to the database. > [1] > ~~~ > > ... > > nova-compute > > > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager > nova.exception.ResourceProviderUpdateFailed: Failed to update resource > provider via URL > /resource_providers/5812a7ab-f8e7-4c57-aba8-f283fedfd04b/inventories: > {"errors": [{"status": 400, "title": "Bad Request", "detail": "The server > could not comply with the request since it is either malformed or otherwise > incorrect.\n\n JSON does not validate: 0 is less than the minimum of 1 > Failed validating 'minimum' in > schema['properties']['inventories']['patternProperties']['^[A-Z0- > 9_]+$']['properties']['total']: {'maximum': 2147483647, 'minimum': 1, > 'type': 'integer'} On instance['inventories']['DISK_GB']['total']: 0 > ", "code": "placement.undefined_code", "request_id": > "req-7fc88053-8e43-4c18-9810-73ad05c3a993"}]} This error ResourceProviderUpdateFailed ^ is saying that an attempt to update inventories on resource provider (compute node) 5812a7ab-f8e7-4c57-aba8-f283fedfd04b failed because of an attempt to set 'total' of resource 'DISK_GB' equal to 0. The request schema requires a minimum value of 1. > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager During handling of the > above exception, another exception occurred: > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager Traceback (most recent > call last): > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 8740, in > _update_available_resource_for_node > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager startup=startup) > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line > 887, in update_available_resource > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager > self._update_available_resource(context, resources, startup=startup) > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 328, > in inner > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager return f(*args, > **kwargs) > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line > 972, in _update_available_resource > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager > self._update(context, cn, startup=startup) > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line > 1237, in _update > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager > self._update_to_placement(context, compute_node, startup) > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/retrying.py", line 68, in wrapped_f > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager return > Retrying(*dargs, **dkw).call(f, *args, **kw) > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/retrying.py", line 223, in call > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager return > attempt.get(self._wrap_exception) > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/retrying.py", line 261, in get > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager > six.reraise(self.value[0], self.value[1], self.value[2]) > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager raise value > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/retrying.py", line 217, in call > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager attempt = > Attempt(fn(*args, **kwargs), attempt_number, False) > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line > 1215, in _update_to_placement > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager allocations=allocs) > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/nova/scheduler/client/report.py", line > 1421, in update_from_provider_tree > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager > self.set_traits_for_provider(context, pd.uuid, pd.traits) > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib64/python3.6/contextlib.py", line 99, in __exit__ > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager > self.gen.throw(type, value, traceback) > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager File > "/usr/lib/python3.6/site-packages/nova/scheduler/client/report.py", line > 1341, in catch_all > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager raise > exception.ResourceProviderSyncFailed() > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager > nova.exception.ResourceProviderSyncFailed: Failed to synchronize the > placement service with resource provider information supplied by the compute > host. > 2020-10-20 11:32:45.484 7 ERROR nova.compute.manager > ~~~ And while handling ResourceProviderUpdateFailed, the cache is cleared and then ResourceProviderSyncFailed is raised. The bug is that the libvirt driver is reporting 'total' for 'DISK_GB' as 0 when the placement API does not allow this. This bug also exists on the master branch. Instead, for all resources we need to check whether there is > 0 available and only report them if that is the case. Else, we should not include them in the PUT request, which will result in the resource inventory being removed from placement (as PUT does a full replace [2]), which is the way that placement is intended to work. We are already doing this for other resource classes like 'VCPU' [3]. [1] https://github.com/openstack/nova/blob/17a233cf77f811f278dfb4b64dd0b0d76ad4b719/nova/scheduler/client/report.py#L1322-L1341 [2] https://docs.openstack.org/api-ref/placement/?expanded=#update-resource-provider-inventory [3] https://github.com/openstack/nova/blob/6a5e158756edc6c01bb23c26dcfc4c3a3df703dc/nova/virt/libvirt/driver.py#L7718
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.4 director bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0817