Description of problem: spawning VMs suddenly stops to work in OSP14 + ODL VM gets into error state with 'Failed to allocate the network(s) for instances' reason for failure this error is not seen in OSP13 jobs regarding networking-odl in the same setup: what osp14 in neutron's server.log the below error message appears on every periodic task on controller-0: 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task [req-fbd79a20-da08-4535-9ebd-18d033ae6667 - - - - -] Failed during periodic task operation journal_recovery.: InvalidRequestError: Object ' <OpenDaylightJournal at 0x7f397e4b5cd0>' is already attached to session '133969' (this is '133978') 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task Traceback (most recent call last): 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task File "/usr/lib/python2.7/site-packages/networking_odl/journal/periodic_task.py", line 65, in _execute_op 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task operation(context) 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task File "/usr/lib/python2.7/site-packages/neutron_lib/db/api.py", line 233, in wrapped 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task return method(*args, **kwargs) 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task File "/usr/lib/python2.7/site-packages/neutron_lib/db/api.py", line 140, in wrapped 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task setattr(e, '_RETRY_EXCEEDED', True) 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task self.force_reraise() 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task six.reraise(self.type_, self.value, self.tb) 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task File "/usr/lib/python2.7/site-packages/neutron_lib/db/api.py", line 136, in wrapped 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task return f(*args, **kwargs) 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task File "/usr/lib/python2.7/site-packages/oslo_db/api.py", line 154, in wrapper 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task ectxt.value = e.inner_exc 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task self.force_reraise() 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task six.reraise(self.type_, self.value, self.tb) 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task File "/usr/lib/python2.7/site-packages/oslo_db/api.py", line 142, in wrapper 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task return f(*args, **kwargs) 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task File "/usr/lib/python2.7/site-packages/neutron_lib/db/api.py", line 183, in wrapped 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task LOG.debug("Retry wrapper got retriable exception: %s", e) 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task self.force_reraise() 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task six.reraise(self.type_, self.value, self.tb) 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task File "/usr/lib/python2.7/site-packages/neutron_lib/db/api.py", line 179, in wrapped 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task return f(*dup_args, **dup_kwargs) 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task File "/usr/lib/python2.7/site-packages/networking_odl/journal/recovery.py", line 53, in journal_recovery 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task _handle_non_existing_resource(context, row) 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task File "/usr/lib/python2.7/site-packages/networking_odl/journal/recovery.py", line 108, in _handle_non_existing_resource 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task _sync_resource_to_odl(context, row, odl_const.ODL_CREATE, False) 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task File "/usr/lib/python2.7/site-packages/networking_odl/journal/recovery.py", line 92, in _sync_resource_to_odl 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task journal.entry_complete(context, row) 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task File "/usr/lib/python2.7/site-packages/neutron_lib/db/api.py", line 233, in wrapped 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task return method(*args, **kwargs) 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 993, in wrapper 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task return fn(*args, **kwargs) 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task File "/usr/lib/python2.7/site-packages/networking_odl/journal/journal.py", line 136, in entry_complete 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task db.delete_row(context, entry) 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task File "/usr/lib/python2.7/site-packages/networking_odl/db/db.py", line 120, in delete_row 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task context.session.delete(row) 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1812, in delete 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task self._delete_impl(state, instance, head=True) 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1824, in _delete_impl 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task to_attach = self._before_attach(state, obj) 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2179, in _before_attach 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task state.session_id, self.hash_key)) 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task InvalidRequestError: Object '<OpenDaylightJournal at 0x7f397e4b5cd0>' is already attached to session '133969' (this is '133978') 2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task there are no error messages in neutron's server.log on controller-1 errors in neutron's server log on controller-2 at the time of instance creation: 2018-09-20 18:15:02.105 8867 INFO neutron.pecan_wsgi.hooks.translation [req-9dbc787c-73fe-42a9-88e4-17c89dd85c3e 8275e38d11b849149c3639748bbc0eea 204b91f241f446e9a99f2c1d2d23d298 - default default] GET failed (client error): The resource could not be found. there's connsantly appearing unhandled exceptions in karaf.log, they also appear around the instance creation time as well as throughout the idle time (this exception was already reported as https://bugzilla.redhat.com/show_bug.cgi?id=1626497), i.e.: [root@controller-0 opendaylight]# grep -i "2018-09-20T17:15.*exception" karaf.log.1 2018-09-20T17:15:21,559 | WARN | nioEventLoopGroup-11-2 | OFFrameDecoder | 386 - org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl - 0.6.3.redhat-3 | Unexpected exception from downstream. 2018-09-20T17:15:21,577 | WARN | nioEventLoopGroup-11-8 | OFFrameDecoder | 386 - org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl - 0.6.3.redhat-3 | Unexpected exception from downstream. 2018-09-20T17:15:27,601 | WARN | nioEventLoopGroup-9-3 | OFFrameDecoder | 386 - org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl - 0.6.3.redhat-3 | Unexpected exception from downstream. 2018-09-20T17:15:31,625 | WARN | nioEventLoopGroup-9-6 | OFFrameDecoder | 386 - org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl - 0.6.3.redhat-3 | Unexpected exception from downstream. 2018-09-20T17:15:33,644 | WARN | nioEventLoopGroup-11-13 | OFFrameDecoder | 386 - org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl - 0.6.3.redhat-3 | Unexpected exception from downstream. 2018-09-20T17:15:49,688 | WARN | nioEventLoopGroup-9-11 | OFFrameDecoder | 386 - org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl - 0.6.3.redhat-3 | Unexpected exception fro network agents seems to be alive: (overcloud) [stack@undercloud-0 ~]$ openstack network agent list +--------------------------------------+----------------+--------------------------+-------------------+-------+-------+------------------------------+ | ID | Agent Type | Host | Availability Zone | Alive | State | Binary | +--------------------------------------+----------------+--------------------------+-------------------+-------+-------+------------------------------+ | 045631e4-299b-4fab-9652-56fd86a99421 | ODL L2 | compute-1.localdomain | None | :-) | UP | neutron-odlagent-portbinding | | 100fbc1e-4a56-4c00-b5b6-52d5e72a00fd | Metadata agent | controller-1.localdomain | None | :-) | UP | neutron-metadata-agent | | 1bad023f-9bad-4d1a-866d-4dd268452f81 | ODL L2 | compute-0.localdomain | None | :-) | UP | neutron-odlagent-portbinding | | 3440fd3d-274e-4368-96ea-7568912a20f1 | ODL L2 | controller-2.localdomain | None | :-) | UP | neutron-odlagent-portbinding | | 37bec116-3af5-4bf1-8000-30461b544983 | ODL L2 | controller-1.localdomain | None | :-) | UP | neutron-odlagent-portbinding | | 6554dee0-0435-48c0-9312-3854b0794f02 | ODL L2 | controller-0.localdomain | None | :-) | UP | neutron-odlagent-portbinding | | 68283729-551b-432c-9707-b9b5c419ac07 | DHCP agent | controller-0.localdomain | nova | :-) | UP | neutron-dhcp-agent | | 8f9c8889-7386-4a3a-9810-f690883e4d9c | Metadata agent | controller-0.localdomain | None | :-) | UP | neutron-metadata-agent | | aeee37fe-e70d-457d-83f9-b6dae478c61f | DHCP agent | controller-2.localdomain | nova | :-) | UP | neutron-dhcp-agent | | cf38b535-433b-4170-844a-288133a17f69 | DHCP agent | controller-1.localdomain | nova | :-) | UP | neutron-dhcp-agent | | fd96f2d0-76c0-4236-95ec-c4a7c7dbde92 | Metadata agent | controller-2.localdomain | None | :-) | UP | neutron-metadata-agent | +--------------------------------------+----------------+--------------------------+-------------------+-------+-------+------------------------------+ Version-Release number of selected component (if applicable): osp14 (puddle: How reproducible: 100% Steps to Reproduce: 1. deploy OSP14 + ODL 2. create flavor, image, network, subnet, router 3. create an instance Actual results: fail to spawn the instance as a result of failure to allocate network Expected results: Additional info:
-2 for now - I need to check whether running CSIT suites (which were restarting opendaylight) had to do something with this problem
I was able to reproduce the issue on a fresh machine/deployment where CSIT did not run, only tempest, tempest failed 2 scenario tests: - test_server_connectivity_cold_migration[compute,id-a4858f6c-401e-4155-9a49-d5cd053d1a2f,network,slow] - test_server_connectivity_cold_migration_revert[compute,id-25b188d7-0183-4b1e-a11d-15840c8e2fd6,network,slow] - test_server_connectivity_resize[compute,id-719eb59d-2f42-4b66-b8b1-bb1254473967,network,slow] (tempest results attached) also, I can't see the 'is already attached to session' periodic task error messages in neutron server logs any more (checked all 3 controllers/neutrons) I can see (only on 1 of 3 neutron servers) these "duplicate IPamAllocation" messages (harmless??): [root@controller-2 neutron]# grep -Ei "error|trac|except" server.log | grep -i duplica 2018-09-21 09:46:43.685 10197 DEBUG neutron_lib.db.api [req-58ec2162-9f49-487c-911e-650e64c269cc - - - - -] Retry wrapper got retriable exception: Failed to create a duplicate IpamAllocation: for attribute(s) ['PRIMARY'] with value(s) 192.168.0.2-86d63683-7544-4d4a-a670-4c81715cc7a9 wrapped /usr/lib/python2.7/site-packages/neutron_lib/db/api.py:183 2018-09-21 09:46:45.011 10197 DEBUG neutron_lib.db.api [req-58ec2162-9f49-487c-911e-650e64c269cc - - - - -] Retry wrapper got retriable exception: Failed to create a duplicate IpamAllocation: for attribute(s) ['PRIMARY'] with value(s) 192.168.0.3-86d63683-7544-4d4a-a670-4c81715cc7a9 wrapped /usr/lib/python2.7/site-packages/neutron_lib/db/api.py:183 details of the private openstack network the VM fails to spawn (to get network allocated): (overcloud) [stack@undercloud-0 ~]$ openstack network show workload_internal_net_0 +---------------------------+--------------------------------------+ | Field | Value | +---------------------------+--------------------------------------+ | admin_state_up | UP | | availability_zone_hints | | | availability_zones | nova | | created_at | 2018-09-21T08:46:33Z | | description | | | dns_domain | None | | id | dd9e0f49-fa4f-4f64-a7e7-bc9de7984e5a | | ipv4_address_scope | None | | ipv6_address_scope | None | | is_default | None | | is_vlan_transparent | None | | mtu | 1450 | | name | workload_internal_net_0 | | port_security_enabled | True | | project_id | 9fe2db29bed14637a78690ac4c5a6027 | | provider:network_type | vxlan | | provider:physical_network | None | | provider:segmentation_id | 37 | | qos_policy_id | None | | revision_number | 2 | | router:external | Internal | | segments | None | | shared | False | | status | ACTIVE | | subnets | fa96e871-62cd-4872-ad24-c75ffb41f1b1 | | tags | | | updated_at | 2018-09-21T08:46:40Z | +---------------------------+--------------------------------------+
investigating libvirt problem that is popping up on compute nodes
Please provide more info and the relevant logs
it seems that... 1. network agents are dead (overcloud) [stack@undercloud-0 results]$ openstack network agent list +--------------------------------------+----------------+--------------------------+-------------------+-------+-------+------------------------------+ | ID | Agent Type | Host | Availability Zone | Alive | State | Binary | +--------------------------------------+----------------+--------------------------+-------------------+-------+-------+------------------------------+ | 16a69c16-0572-48b8-b00f-2db6bb42e26b | ODL L2 | controller-1.localdomain | None | XXX | UP | neutron-odlagent-portbinding | | 352db67c-5d1f-43c0-aeb5-f844e81af981 | DHCP agent | controller-0.localdomain | nova | XXX | UP | neutron-dhcp-agent | | 498f5660-635a-49b1-8e82-8f64656977b8 | ODL L2 | controller-0.localdomain | None | XXX | UP | neutron-odlagent-portbinding | | 4d6d6abd-99fc-4a5d-a9ac-edc1b430da70 | ODL L2 | compute-0.localdomain | None | XXX | UP | neutron-odlagent-portbinding | | 4e594765-53b0-45a3-85d0-368ba3dd2b52 | DHCP agent | controller-2.localdomain | nova | XXX | UP | neutron-dhcp-agent | | 54bac610-d2b5-4366-92cb-441876917ad0 | Metadata agent | controller-2.localdomain | None | XXX | UP | neutron-metadata-agent | | 5b7e40b7-e97d-4a81-99c4-f1781c3ddbf8 | Metadata agent | controller-1.localdomain | None | XXX | UP | neutron-metadata-agent | | 8664a7f7-fd09-4734-ab8c-6dde4a2696bc | ODL L2 | compute-1.localdomain | None | XXX | UP | neutron-odlagent-portbinding | | 96b8e4df-2dc1-462a-8c2d-8a2046d9184b | ODL L2 | controller-2.localdomain | None | XXX | UP | neutron-odlagent-portbinding | | a53eb97a-d93e-46ab-9d73-f7fe1db2fe7c | Metadata agent | controller-0.localdomain | None | XXX | UP | neutron-metadata-agent | | c585e594-6731-4d91-a9b5-2c0b425c6632 | DHCP agent | controller-1.localdomain | nova | XXX | UP | neutron-dhcp-agent | +--------------------------------------+----------------+--------------------------+-------------------+-------+-------+------------------------------+ that's possibly because... 2. neutron_api docker container is unhealthy (and doesn't recover): a16c65e9b0a2 192.168.24.1:8787/rhosp14/openstack-neutron-server-opendaylight:2018-09-26.1 "kolla_start" 43 hours ago Up 35 hours (unhealthy) neutron_api "Health": { "Status": "unhealthy", "FailingStreak": 827, "Log": [ { "Start": "2018-10-04T05:13:43.109357324-04:00", "End": "2018-10-04T05:13:53.409602401-04:00", "ExitCode": 1, "Output": " % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:04 --:--:-- 0\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:05 --:--:-- 0\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:06 --:--:-- 0\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:07 --:--:-- 0\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:08 --:--:-- 0\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:09 --:--:-- 0\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:10 --:--:-- 0\n\n000 192.168.24.1:9696 10.002 seconds\ncurl: (28) Operation timed out after 10001 milliseconds with 0 out of -1 bytes received\n" }, 3. server.log shows that the neutron process might have ended processing (because the log file hasn't been updated in several hours): 2018-10-04 06:01:18.668 29251 DEBUG networking_odl.journal.journal [req-dfd667f1-e123-4691-81f2-1303f67b8232 - - - - -] Start processing journal entries sync_pending_entries /usr/lib/python2.7/site-packages/networking_odl/journal/journal.py:250 2018-10-04 06:01:18.682 1 INFO oslo_service.service [-] Child 29249 exited with status 0 2018-10-04 06:01:18.683 1 WARNING oslo_service.service [-] pid 29249 not in child list: OSError: [Errno 3] No such process 2018-10-04 06:01:18.695 1 INFO oslo_service.service [-] Child 29252 exited with status 0 2018-10-04 06:01:18.703 1 INFO oslo_service.service [-] Child 29253 exited with status 0 2018-10-04 06:01:18.729 29251 DEBUG networking_odl.journal.journal [req-dfd667f1-e123-4691-81f2-1303f67b8232 - - - - -] No journal entries to process sync_pending_entries /usr/lib/python2.7/site-packages/networking_odl/journal/journal.py:254 2018-10-04 06:01:18.748 29251 DEBUG oslo_concurrency.lockutils [-] Acquired semaphore "singleton_lock" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:212 2018-10-04 06:01:18.749 29251 DEBUG oslo_concurrency.lockutils [-] Releasing semaphore "singleton_lock" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:228 2018-10-04 06:01:18.769 1 INFO oslo_service.service [-] Child 29251 exited with status 0 (END) 4. the fact this container doesn't recover from 'unhealthy' status is very similar in behaviour to a nova_placement issue disocvered and fixed rcently: https://bugzilla.redhat.com/show_bug.cgi?id=1635665 (also reported as https://bugzilla.redhat.com/show_bug.cgi?id=1630069 <- which is fixed in newer versions of nova RPM) 5. it happened (the unhealthy state) only on 1 of 3 controllers in a HA cluster (other 2 were healthy) 6. the agents get back to :-) status after 'docker restart neutron_api' ... altho spawning VM instances did not started to work, still showing failed to allocate networks with error in nova-compute: 2018-10-04 10:33:19.554 1 ERROR nova.compute.manager [instance: 7c8f7701-09c0-4e73-8dc4-81b7b0668b45] raise exception.VirtualInterfaceCreateException() 2018-10-04 10:33:19.554 1 ERROR nova.compute.manager [instance: 7c8f7701-09c0-4e73-8dc4-81b7b0668b45] VirtualInterfaceCreateException: Virtual Interface creation failed raised in https://bugzilla.redhat.com/show_bug.cgi?id=1636034 Q1. why did the neutron_api not recover? Q2. why the other 2 UP neutron_api server didn't not take over the load?
*** This bug has been marked as a duplicate of bug 1631335 ***