1631490 – [CI] neutron_api container doesn't recover from 'unhealthy' state

Bug 1631490 - [CI] neutron_api container doesn't recover from 'unhealthy' state

Summary: [CI] neutron_api container doesn't recover from 'unhealthy' state

Keywords:
Status:	CLOSED DUPLICATE of bug 1631335
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	14.0 (Rocky)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Assaf Muller
QA Contact:	Roee Agiman
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-09-20 17:25 UTC by Waldemar Znoinski
Modified:	2018-10-24 12:37 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	N/A
Last Closed:	2018-10-04 10:39:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Waldemar Znoinski 2018-09-20 17:25:51 UTC

Description of problem:
spawning VMs suddenly stops to work in OSP14 + ODL
VM gets into error state with 'Failed to allocate the network(s) for instances' reason for failure

this error is not seen in OSP13 jobs

regarding networking-odl in the same setup:

what osp14 in neutron's server.log the below error message appears on every periodic task on controller-0:

2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task [req-fbd79a20-da08-4535-9ebd-18d033ae6667 - - - - -] Failed during periodic task operation journal_recovery.: InvalidRequestError: Object '
<OpenDaylightJournal at 0x7f397e4b5cd0>' is already attached to session '133969' (this is '133978')
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task Traceback (most recent call last):
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task   File "/usr/lib/python2.7/site-packages/networking_odl/journal/periodic_task.py", line 65, in _execute_op
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task     operation(context)
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task   File "/usr/lib/python2.7/site-packages/neutron_lib/db/api.py", line 233, in wrapped
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task     return method(*args, **kwargs)
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task   File "/usr/lib/python2.7/site-packages/neutron_lib/db/api.py", line 140, in wrapped
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task     setattr(e, '_RETRY_EXCEEDED', True)
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task     self.force_reraise()
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task     six.reraise(self.type_, self.value, self.tb)
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task   File "/usr/lib/python2.7/site-packages/neutron_lib/db/api.py", line 136, in wrapped
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task     return f(*args, **kwargs)
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task   File "/usr/lib/python2.7/site-packages/oslo_db/api.py", line 154, in wrapper
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task     ectxt.value = e.inner_exc
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task     self.force_reraise()
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task     six.reraise(self.type_, self.value, self.tb)
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task   File "/usr/lib/python2.7/site-packages/oslo_db/api.py", line 142, in wrapper
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task     return f(*args, **kwargs)
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task   File "/usr/lib/python2.7/site-packages/neutron_lib/db/api.py", line 183, in wrapped
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task     LOG.debug("Retry wrapper got retriable exception: %s", e)
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task     self.force_reraise()
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task     six.reraise(self.type_, self.value, self.tb)
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task   File "/usr/lib/python2.7/site-packages/neutron_lib/db/api.py", line 179, in wrapped
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task     return f(*dup_args, **dup_kwargs)
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task   File "/usr/lib/python2.7/site-packages/networking_odl/journal/recovery.py", line 53, in journal_recovery
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task     _handle_non_existing_resource(context, row)
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task   File "/usr/lib/python2.7/site-packages/networking_odl/journal/recovery.py", line 108, in _handle_non_existing_resource
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task     _sync_resource_to_odl(context, row, odl_const.ODL_CREATE, False)
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task   File "/usr/lib/python2.7/site-packages/networking_odl/journal/recovery.py", line 92, in _sync_resource_to_odl
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task     journal.entry_complete(context, row)
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task   File "/usr/lib/python2.7/site-packages/neutron_lib/db/api.py", line 233, in wrapped
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task     return method(*args, **kwargs)
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task   File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 993, in wrapper
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task     return fn(*args, **kwargs)
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task   File "/usr/lib/python2.7/site-packages/networking_odl/journal/journal.py", line 136, in entry_complete
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task     db.delete_row(context, entry)
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task   File "/usr/lib/python2.7/site-packages/networking_odl/db/db.py", line 120, in delete_row
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task     context.session.delete(row)
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1812, in delete
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task     self._delete_impl(state, instance, head=True)
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1824, in _delete_impl
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task     to_attach = self._before_attach(state, obj)
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2179, in _before_attach
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task     state.session_id, self.hash_key))
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task InvalidRequestError: Object '<OpenDaylightJournal at 0x7f397e4b5cd0>' is already attached to session '133969' (this is '133978')
2018-09-20 16:20:16.068 8814 ERROR networking_odl.journal.periodic_task 


there are no error messages in neutron's server.log on controller-1

errors in neutron's server log on controller-2 at the time of instance creation:

2018-09-20 18:15:02.105 8867 INFO neutron.pecan_wsgi.hooks.translation [req-9dbc787c-73fe-42a9-88e4-17c89dd85c3e 8275e38d11b849149c3639748bbc0eea 204b91f241f446e9a99f2c1d2d23d298 - default default] GET failed (client error): The resource could not be found.


there's connsantly appearing unhandled exceptions in karaf.log, they also appear around the instance creation time as well as throughout the idle time (this exception was already reported as https://bugzilla.redhat.com/show_bug.cgi?id=1626497), i.e.:

[root@controller-0 opendaylight]# grep -i "2018-09-20T17:15.*exception" karaf.log.1 
2018-09-20T17:15:21,559 | WARN  | nioEventLoopGroup-11-2 | OFFrameDecoder                   | 386 - org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl - 0.6.3.redhat-3 | Unexpected exception from downstream.
2018-09-20T17:15:21,577 | WARN  | nioEventLoopGroup-11-8 | OFFrameDecoder                   | 386 - org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl - 0.6.3.redhat-3 | Unexpected exception from downstream.
2018-09-20T17:15:27,601 | WARN  | nioEventLoopGroup-9-3 | OFFrameDecoder                   | 386 - org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl - 0.6.3.redhat-3 | Unexpected exception from downstream.
2018-09-20T17:15:31,625 | WARN  | nioEventLoopGroup-9-6 | OFFrameDecoder                   | 386 - org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl - 0.6.3.redhat-3 | Unexpected exception from downstream.
2018-09-20T17:15:33,644 | WARN  | nioEventLoopGroup-11-13 | OFFrameDecoder                   | 386 - org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl - 0.6.3.redhat-3 | Unexpected exception from downstream.
2018-09-20T17:15:49,688 | WARN  | nioEventLoopGroup-9-11 | OFFrameDecoder                   | 386 - org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl - 0.6.3.redhat-3 | Unexpected exception fro


network agents seems to be alive:
(overcloud) [stack@undercloud-0 ~]$ openstack network agent list
+--------------------------------------+----------------+--------------------------+-------------------+-------+-------+------------------------------+
| ID                                   | Agent Type     | Host                     | Availability Zone | Alive | State | Binary                       |
+--------------------------------------+----------------+--------------------------+-------------------+-------+-------+------------------------------+
| 045631e4-299b-4fab-9652-56fd86a99421 | ODL L2         | compute-1.localdomain    | None              | :-)   | UP    | neutron-odlagent-portbinding |
| 100fbc1e-4a56-4c00-b5b6-52d5e72a00fd | Metadata agent | controller-1.localdomain | None              | :-)   | UP    | neutron-metadata-agent       |
| 1bad023f-9bad-4d1a-866d-4dd268452f81 | ODL L2         | compute-0.localdomain    | None              | :-)   | UP    | neutron-odlagent-portbinding |
| 3440fd3d-274e-4368-96ea-7568912a20f1 | ODL L2         | controller-2.localdomain | None              | :-)   | UP    | neutron-odlagent-portbinding |
| 37bec116-3af5-4bf1-8000-30461b544983 | ODL L2         | controller-1.localdomain | None              | :-)   | UP    | neutron-odlagent-portbinding |
| 6554dee0-0435-48c0-9312-3854b0794f02 | ODL L2         | controller-0.localdomain | None              | :-)   | UP    | neutron-odlagent-portbinding |
| 68283729-551b-432c-9707-b9b5c419ac07 | DHCP agent     | controller-0.localdomain | nova              | :-)   | UP    | neutron-dhcp-agent           |
| 8f9c8889-7386-4a3a-9810-f690883e4d9c | Metadata agent | controller-0.localdomain | None              | :-)   | UP    | neutron-metadata-agent       |
| aeee37fe-e70d-457d-83f9-b6dae478c61f | DHCP agent     | controller-2.localdomain | nova              | :-)   | UP    | neutron-dhcp-agent           |
| cf38b535-433b-4170-844a-288133a17f69 | DHCP agent     | controller-1.localdomain | nova              | :-)   | UP    | neutron-dhcp-agent           |
| fd96f2d0-76c0-4236-95ec-c4a7c7dbde92 | Metadata agent | controller-2.localdomain | None              | :-)   | UP    | neutron-metadata-agent       |
+--------------------------------------+----------------+--------------------------+-------------------+-------+-------+------------------------------+




Version-Release number of selected component (if applicable):
osp14 (puddle: 

How reproducible:
100%

Steps to Reproduce:
1. deploy OSP14 + ODL
2. create flavor, image, network, subnet, router
3. create an instance

Actual results:
fail to spawn the instance as a result of failure to allocate network

Expected results:


Additional info:

Comment 1 Waldemar Znoinski 2018-09-20 17:28:47 UTC

-2 for now - I need to check whether running CSIT suites (which were restarting opendaylight) had to do something with this problem

Comment 2 Waldemar Znoinski 2018-09-21 10:01:35 UTC

I was able to reproduce the issue on a fresh machine/deployment where CSIT did not run, only tempest, tempest failed 2 scenario tests:

- test_server_connectivity_cold_migration[compute,id-a4858f6c-401e-4155-9a49-d5cd053d1a2f,network,slow]
- test_server_connectivity_cold_migration_revert[compute,id-25b188d7-0183-4b1e-a11d-15840c8e2fd6,network,slow]
- test_server_connectivity_resize[compute,id-719eb59d-2f42-4b66-b8b1-bb1254473967,network,slow]

(tempest results attached)

also, I can't see the 'is already attached to session' periodic task error messages in neutron server logs any more (checked all 3 controllers/neutrons)

I can see (only on 1 of 3 neutron servers) these "duplicate IPamAllocation" messages (harmless??):
[root@controller-2 neutron]# grep -Ei "error|trac|except" server.log | grep -i duplica                                                                                                                             
2018-09-21 09:46:43.685 10197 DEBUG neutron_lib.db.api [req-58ec2162-9f49-487c-911e-650e64c269cc - - - - -] Retry wrapper got retriable exception: Failed to create a duplicate IpamAllocation: for attribute(s) ['PRIMARY'] with value(s) 192.168.0.2-86d63683-7544-4d4a-a670-4c81715cc7a9 wrapped /usr/lib/python2.7/site-packages/neutron_lib/db/api.py:183
2018-09-21 09:46:45.011 10197 DEBUG neutron_lib.db.api [req-58ec2162-9f49-487c-911e-650e64c269cc - - - - -] Retry wrapper got retriable exception: Failed to create a duplicate IpamAllocation: for attribute(s) ['PRIMARY'] with value(s) 192.168.0.3-86d63683-7544-4d4a-a670-4c81715cc7a9 wrapped /usr/lib/python2.7/site-packages/neutron_lib/db/api.py:183


details of the private openstack network the VM fails to spawn (to get network allocated):
(overcloud) [stack@undercloud-0 ~]$ openstack network show workload_internal_net_0
+---------------------------+--------------------------------------+
| Field                     | Value                                |
+---------------------------+--------------------------------------+
| admin_state_up            | UP                                   |
| availability_zone_hints   |                                      |
| availability_zones        | nova                                 |
| created_at                | 2018-09-21T08:46:33Z                 |
| description               |                                      |
| dns_domain                | None                                 |
| id                        | dd9e0f49-fa4f-4f64-a7e7-bc9de7984e5a |
| ipv4_address_scope        | None                                 |
| ipv6_address_scope        | None                                 |
| is_default                | None                                 |
| is_vlan_transparent       | None                                 |
| mtu                       | 1450                                 |
| name                      | workload_internal_net_0              |
| port_security_enabled     | True                                 |
| project_id                | 9fe2db29bed14637a78690ac4c5a6027     |
| provider:network_type     | vxlan                                |
| provider:physical_network | None                                 |
| provider:segmentation_id  | 37                                   |
| qos_policy_id             | None                                 |
| revision_number           | 2                                    |
| router:external           | Internal                             |
| segments                  | None                                 |
| shared                    | False                                |
| status                    | ACTIVE                               |
| subnets                   | fa96e871-62cd-4872-ad24-c75ffb41f1b1 |
| tags                      |                                      |
| updated_at                | 2018-09-21T08:46:40Z                 |
+---------------------------+--------------------------------------+

Comment 3 Waldemar Znoinski 2018-09-23 11:15:34 UTC

investigating libvirt problem that is popping up on compute nodes

Comment 4 Mike Kolesnik 2018-10-03 09:22:10 UTC

Please provide more info and the relevant logs

Comment 5 Waldemar Znoinski 2018-10-04 10:03:11 UTC

it seems that...

1. network agents are dead
(overcloud) [stack@undercloud-0 results]$ openstack network agent list
+--------------------------------------+----------------+--------------------------+-------------------+-------+-------+------------------------------+
| ID                                   | Agent Type     | Host                     | Availability Zone | Alive | State | Binary                       |
+--------------------------------------+----------------+--------------------------+-------------------+-------+-------+------------------------------+
| 16a69c16-0572-48b8-b00f-2db6bb42e26b | ODL L2         | controller-1.localdomain | None              | XXX   | UP    | neutron-odlagent-portbinding |
| 352db67c-5d1f-43c0-aeb5-f844e81af981 | DHCP agent     | controller-0.localdomain | nova              | XXX   | UP    | neutron-dhcp-agent           |
| 498f5660-635a-49b1-8e82-8f64656977b8 | ODL L2         | controller-0.localdomain | None              | XXX   | UP    | neutron-odlagent-portbinding |
| 4d6d6abd-99fc-4a5d-a9ac-edc1b430da70 | ODL L2         | compute-0.localdomain    | None              | XXX   | UP    | neutron-odlagent-portbinding |
| 4e594765-53b0-45a3-85d0-368ba3dd2b52 | DHCP agent     | controller-2.localdomain | nova              | XXX   | UP    | neutron-dhcp-agent           |
| 54bac610-d2b5-4366-92cb-441876917ad0 | Metadata agent | controller-2.localdomain | None              | XXX   | UP    | neutron-metadata-agent       |
| 5b7e40b7-e97d-4a81-99c4-f1781c3ddbf8 | Metadata agent | controller-1.localdomain | None              | XXX   | UP    | neutron-metadata-agent       |
| 8664a7f7-fd09-4734-ab8c-6dde4a2696bc | ODL L2         | compute-1.localdomain    | None              | XXX   | UP    | neutron-odlagent-portbinding |
| 96b8e4df-2dc1-462a-8c2d-8a2046d9184b | ODL L2         | controller-2.localdomain | None              | XXX   | UP    | neutron-odlagent-portbinding |
| a53eb97a-d93e-46ab-9d73-f7fe1db2fe7c | Metadata agent | controller-0.localdomain | None              | XXX   | UP    | neutron-metadata-agent       |
| c585e594-6731-4d91-a9b5-2c0b425c6632 | DHCP agent     | controller-1.localdomain | nova              | XXX   | UP    | neutron-dhcp-agent           |
+--------------------------------------+----------------+--------------------------+-------------------+-------+-------+------------------------------+


that's possibly because...

2. neutron_api docker container is unhealthy (and doesn't recover):
a16c65e9b0a2        192.168.24.1:8787/rhosp14/openstack-neutron-server-opendaylight:2018-09-26.1   "kolla_start"            43 hours ago        Up 35 hours (unhealthy)                       neutron_api


            "Health": {
                "Status": "unhealthy",
                "FailingStreak": 827,
                "Log": [
                    {
                        "Start": "2018-10-04T05:13:43.109357324-04:00",
                        "End": "2018-10-04T05:13:53.409602401-04:00",
                        "ExitCode": 1,
                        "Output": "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\r  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0\r  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0\r  0     0    0     0    0     0      0      0 --:--:--  0:00:03 --:--:--     0\r  0     0    0     0    0     0      0      0 --:--:--  0:00:04 --:--:--     0\r  0     0    0     0    0     0      0      0 --:--:--  0:00:05 --:--:--     0\r  0     0    0     0    0     0      0      0 --:--:--  0:00:06 --:--:--     0\r  0     0    0     0    0     0      0      0 --:--:--  0:00:07 --:--:--     0\r  0     0    0     0    0     0      0      0 --:--:--  0:00:08 --:--:--     0\r  0     0    0     0    0     0      0      0 --:--:--  0:00:09 --:--:--     0\r  0     0    0     0    0     0      0      0 --:--:--  0:00:10 --:--:--     0\n\n000 192.168.24.1:9696 10.002 seconds\ncurl: (28) Operation timed out after 10001 milliseconds with 0 out of -1 bytes received\n"
                    },


3. server.log shows that the neutron process might have ended processing (because the log file hasn't been updated in several hours):

2018-10-04 06:01:18.668 29251 DEBUG networking_odl.journal.journal [req-dfd667f1-e123-4691-81f2-1303f67b8232 - - - - -] Start processing journal entries sync_pending_entries /usr/lib/python2.7/site-packages/networking_odl/journal/journal.py:250
2018-10-04 06:01:18.682 1 INFO oslo_service.service [-] Child 29249 exited with status 0
2018-10-04 06:01:18.683 1 WARNING oslo_service.service [-] pid 29249 not in child list: OSError: [Errno 3] No such process
2018-10-04 06:01:18.695 1 INFO oslo_service.service [-] Child 29252 exited with status 0
2018-10-04 06:01:18.703 1 INFO oslo_service.service [-] Child 29253 exited with status 0
2018-10-04 06:01:18.729 29251 DEBUG networking_odl.journal.journal [req-dfd667f1-e123-4691-81f2-1303f67b8232 - - - - -] No journal entries to process sync_pending_entries /usr/lib/python2.7/site-packages/networking_odl/journal/journal.py:254
2018-10-04 06:01:18.748 29251 DEBUG oslo_concurrency.lockutils [-] Acquired semaphore "singleton_lock" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:212
2018-10-04 06:01:18.749 29251 DEBUG oslo_concurrency.lockutils [-] Releasing semaphore "singleton_lock" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:228
2018-10-04 06:01:18.769 1 INFO oslo_service.service [-] Child 29251 exited with status 0
(END)


4. the fact this container doesn't recover from 'unhealthy' status is very similar in behaviour to a nova_placement issue disocvered and fixed rcently:
https://bugzilla.redhat.com/show_bug.cgi?id=1635665 (also reported as https://bugzilla.redhat.com/show_bug.cgi?id=1630069 <- which is fixed in newer versions of nova RPM)

5. it happened (the unhealthy state) only on 1 of 3 controllers in a HA cluster (other 2 were healthy)

6. the agents get back to :-) status after 'docker restart neutron_api' ... altho spawning VM instances did not started to work, still showing failed to allocate networks with error in nova-compute:
2018-10-04 10:33:19.554 1 ERROR nova.compute.manager [instance: 7c8f7701-09c0-4e73-8dc4-81b7b0668b45]     raise exception.VirtualInterfaceCreateException()
2018-10-04 10:33:19.554 1 ERROR nova.compute.manager [instance: 7c8f7701-09c0-4e73-8dc4-81b7b0668b45] VirtualInterfaceCreateException: Virtual Interface creation failed

raised in https://bugzilla.redhat.com/show_bug.cgi?id=1636034


Q1. why did the neutron_api not recover?
Q2. why the other 2 UP neutron_api server didn't not take over the load?

Comment 6 Waldemar Znoinski 2018-10-04 10:39:48 UTC


*** This bug has been marked as a duplicate of bug 1631335 ***

Note You need to log in before you can comment on or make changes to this bug.