Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1696844

Summary: Instance stuck at creation phase - nova: timeout notifying of ports ready
Product: Red Hat OpenStack Reporter: Cristian Muresanu <cmuresan>
Component: openstack-neutronAssignee: Brian Haley <bhaley>
Status: CLOSED WORKSFORME QA Contact: Roee Agiman <ragiman>
Severity: urgent Docs Contact:
Priority: medium    
Version: 13.0 (Queens)CC: abeekhof, amuller, bhaley, chrisw, dhill, michele, mircea.vutcovici, mwitt, njohnston, plemenko, scohen, takirby
Target Milestone: ---Keywords: ZStream
Target Release: ---Flags: bhaley: needinfo-
bhaley: needinfo-
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-20 11:44:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
SASL log from RabbitMQ with errors
none
Full RabbitMQ log none

Description Cristian Muresanu 2019-04-05 17:53:40 UTC
Description of problem:
5 controllers environment trying to downgrade to 3
- Instance stuck at creation phase @ nova: timeout notifying of ports ready

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

2019-04-05 08:43:57.324 1 DEBUG nova.compute.manager [req-b68532ab-535f-4b25-a062-a5a9cc3463ff 1ec0de72145d4a738c5d3a11df42abde 59ab9dad8ae04bc180dc08b907bb7b1b - default default] [instance: 4b75518a-0c20-4642-b538-2e1beca5b733] Processing event network-vif-plugged-76bfd0be-e7fc-4794-a97b-c7f79aca2cad _process_instance_event /usr/lib/python2.7/site-packages/nova/compute/manager.py:7620


However in your logs we don't see this event:

 grep 'Received event network-vif-plugged' nova-compute.log 
[nil]

~~~
Example of a failure.

Summary:  Nova-compute fails to launch the instance as it is waiting on neutron to setup the port.  After a 300 second timeout it fails.  According to the neutron logs the port is setup properly and marked as UP.  The mystery is why is this failing??

Nova instance ID:  edc9363d-098b-4b93-b7ca-776576c7ccc7
Neutron port ID: 6eb2f9ad-626a-48f4-aa44-7ad0ee3f20e7

From nova-compute.log:

2019-04-04 17:39:08.432 1 ERROR nova.compute.manager [req-8aa589cb-ee57-4cc6-b45b-7fcd18b1924c ae02f4e902974aea9f09234215e249f8 13e379f360e14a55993fa2bad3d1b7f6 - default default] [instance: edc9363d-098b-4b93-b7ca-776576c7ccc7] Build of instance edc9363d-098b-4b93-b7ca-776576c7ccc7 aborted: Failed to allocate the network(s), not rescheduling.: BuildAbortException: Build of instance edc9363d-098b-4b93-b7ca-776576c7ccc7 aborted: Failed to allocate the network(s), not rescheduling.
2019-04-04 17:39:08.432 1 ERROR nova.compute.manager [instance: edc9363d-098b-4b93-b7ca-776576c7ccc7] Traceback (most recent call last):
2019-04-04 17:39:08.432 1 ERROR nova.compute.manager [instance: edc9363d-098b-4b93-b7ca-776576c7ccc7]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1828, in _do_build_and_run_instance
2019-04-04 17:39:08.432 1 ERROR nova.compute.manager [instance: edc9363d-098b-4b93-b7ca-776576c7ccc7]     filter_properties, request_spec)
2019-04-04 17:39:08.432 1 ERROR nova.compute.manager [instance: edc9363d-098b-4b93-b7ca-776576c7ccc7]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2079, in _build_and_run_instance
2019-04-04 17:39:08.432 1 ERROR nova.compute.manager [instance: edc9363d-098b-4b93-b7ca-776576c7ccc7]     reason=msg)
2019-04-04 17:39:08.432 1 ERROR nova.compute.manager [instance: edc9363d-098b-4b93-b7ca-776576c7ccc7] BuildAbortException: Build of instance edc9363d-098b-4b93-b7ca-776576c7ccc7 aborted: Failed to allocate the network(s), not rescheduling.
2019-04-04 17:39:08.432 1 ERROR nova.compute.manager [instance: edc9363d-098b-4b93-b7ca-776576c7ccc7]
2019-04-04 17:39:08.432 1 DEBUG nova.compute.manager [req-8aa589cb-ee57-4cc6-b45b-7fcd18b1924c ae02f4e902974aea9f09234215e249f8 13e379f360e14a55993fa2bad3d1b7f6 - default default] [instance: edc9363d-098b-4b93-b7ca-776576c7ccc7] Deallocating network for instance _deallocate_network /usr/lib/python2.7/site-packages/nova/compute/manager.py:1662

Here we see that nova can get the vif details from neutron:


2019-04-04 17:39:06.596 1 DEBUG nova.network.os_vif_util [req-8aa589cb-ee57-4cc6-b45b-7fcd18b1924c ae02f4e902974aea9f09234215e249f8 13e379f360e14a55993fa2bad3d1b7f6 - default default] Converting VIF {"profile": {}, "ovs_interfaceid": "6eb2f9ad-626a-48f4-aa44-7ad0ee3f20e7", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "192.168.96.18"}], "version": 4, "meta": {"dhcp_server": "192.168.96.7"}, "dns": [{"meta": {}, "version": 4, "type": "dns", "address": "138.120.252.54"}, {"meta": {}, "version": 4, "type": "dns", "address": "135.5.25.53"}], "routes": [], "cidr": "192.168.96.0/22", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "192.168.96.1"}}], "meta": {"injected": false, "tenant_id": "13e379f360e14a55993fa2bad3d1b7f6", "mtu": 8950}, "id": "91a32a42-0111-4b79-a6b3-7d0a347dbeea", "label": "kanata-regression_net"}, "devname": "tap6eb2f9ad-62", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:96:bd:4e", "active": false, "type": "ovs", "id": "6eb2f9ad-626a-48f4-aa44-7ad0ee3f20e7", "qbg_params": null} nova_to_osvif_vif /usr/lib/python2.7/site-packages/nova/network/os_vif_util.py:484



However Neutron show that the port has been successfully setup:


From neutron server.log (server.log.1 attachment):

2019-04-04 17:34:06.492 39 DEBUG neutron.plugins.ml2.rpc [req-ee8aae59-33a9-417c-ba0c-791f51231215 - - - - -] Device 6eb2f9ad-626a-48f4-aa44-7ad0ee3f20e7 up at agent ovs-agent-nspcloud-compute-56.localdomain update_device_up /usr/lib/python2.7/site-packages/neutron/plugins/ml2/rpc.py:244

gin._port_provisioned-841481'] for port, provisioning_complete _notify_loop /usr/lib/python2.7/site-packages/neutron_lib/callbacks/manager.py:167
2019-04-04 17:34:06.538 39 DEBUG neutron.db.provisioning_blocks [req-ee8aae59-33a9-417c-ba0c-791f51231215 - - - - -] Provisioning for port 6eb2f9ad-626a-48f4-aa44-7ad0ee3f20e7 completed by entity L2. provisioning_complete /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:132


From openvswitch-agent.log.1 shows portbinding as successful:

2019-04-04 17:34:06.491 934446 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-b00358e7-dfc1-4a95-bdb8-e73793a21f40 - - - - -] Setting status for 6eb2f9ad-626a-48f4-aa44-7ad0ee3f20e7 to UP _bind_devices /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:867
~~~

Comment 2 melanie witt 2019-04-05 18:43:38 UTC
In the nova-compute.log I see this:

2019-04-05 09:06:57.652 1 WARNING nova.virt.libvirt.driver [req-5b87720d-d4bb-4838-945f-c69b2bef4913 0cdec46214b43190a0fb8a6bd09391d6d10fb8a4ac31cc822111be8b97f15315 c469ddd4363f4b96b8206f8c8904d2da - eb837e5e103c48dfb8a9542518598494 eb837e5e103c48dfb8a9542518598494] [instance: 0e92d71f-8672-4cd0-a340-47ecf4d47362] Timeout waiting for [('network-vif-plugged', u'fd51efe8-36fa-47df-aef9-89820cfff60a')] for instance with vm_state building and task_state spawning.: Timeout: 300 seconds

where nova is waiting for the 'network-vif-plugged' event from neutron.

And in the neutron server.log.5 I see this:

2019-04-05 09:01:54.731 33 DEBUG neutron.notifiers.nova [-] Sending events: [{'tag': u'fd51efe8-36fa-47df-aef9-89820cfff60a', 'name': 'network-changed', 'server_uuid': u'0e92d71f-8672-4cd0-a340-47ecf4d47362'}] send_events /usr/lib/python2.7/site-packages/neutron/notifiers/nova.py:241

where neutron is sending the 'network-changed' event to nova, 5 minutes (300 seconds) before nova times out. But it does not send a 'network-vif-plugged' event (and thus nova times out waiting for it).

We'll need to consult with a neutron expert to learn why/how neutron is sending a 'network-changed' event instead of a 'network-vif-plugged' event.

Comment 4 Nate Johnston 2019-04-05 19:38:30 UTC
Timeline of a failure:

9:01:55 Nova plugs the VIF, and the OVS agent starts working:

Nova Compute: 2019-04-05 09:01:55.778 1 INFO os_vif [req-5b87720d-d4bb-4838-945f-c69b2bef4913 0cdec46214b43190a0fb8a6bd09391d6d10fb8a4ac31cc822111be8b97f15315 c469ddd4363f4b96b8206f8c8904d2da - eb837e5e103c48dfb8a9542518598494 eb837e5e103c48dfb8a9542518598494] Successfully plugged vif VIFBridge(active=False,address=fa:16:3e:2c:97:3c,bridge_name='qbrfd51efe8-36',has_traffic_filtering=True,id=fd51efe8-36fa-47df-aef9-89820cfff60a,network=Network(24965cb1-4d5d-463e-ad17-37ce65198e82),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tapfd51efe8-36')

Then the OVS agent starts working:

Neutron OVS Agent: 2019-04-05 09:01:55.889 62794 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-6f1d7492-a27d-4cce-b9dc-10b23aa91805 - - - - -] Port fd51efe8-36fa-47df-aef9-89820cfff60a updated. Details: {'profile': {}, 'network_qos_policy_id': None, 'qos_policy_id': None, 'allowed_address_pairs': [], 'admin_state_up': True, 'network_id': '24965cb1-4d5d-463e-ad17-37ce65198e82', 'segmentation_id': 113, 'fixed_ips': [{'subnet_id': 'b037f603-4ca5-4a8b-b4ed-c38a54d34a65', 'ip_address': '192.168.96.10'}], 'device_owner': u'compute:testZone', 'physical_network': None, 'mac_address': 'fa:16:3e:2c:97:3c', 'device': u'fd51efe8-36fa-47df-aef9-89820cfff60a', 'port_security_enabled': True, 'port_id': 'fd51efe8-36fa-47df-aef9-89820cfff60a', 'network_type': u'vxlan', 'security_groups': [u'5b266fb3-a59e-4ea9-a923-c2afa057ebaa']}
Neutron OVS Agent: 2019-04-05 09:01:55.913 62794 INFO neutron.agent.securitygroups_rpc [req-6f1d7492-a27d-4cce-b9dc-10b23aa91805 - - - - -] Preparing filters for devices set([u'fd51efe8-36fa-47df-aef9-89820cfff60a'])
Neutron OVS Agent: 2019-04-05 09:01:56.254 62794 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-6f1d7492-a27d-4cce-b9dc-10b23aa91805 - - - - -] Configuration for devices up [u'fd51efe8-36fa-47df-aef9-89820cfff60a'] and devices down [] completed.

Two seconds later, Nova pauses the VM?

2019-04-05 09:01:57.635 1 INFO nova.compute.manager [req-e6c146b2-f244-4476-9b00-c4101d32c343 - - - - -] [instance: 0e92d71f-8672-4cd0-a340-47ecf4d47362] VM Started (Lifecycle Event)
2019-04-05 09:01:57.681 1 INFO nova.compute.manager [req-e6c146b2-f244-4476-9b00-c4101d32c343 - - - - -] [instance: 0e92d71f-8672-4cd0-a340-47ecf4d47362] VM Paused (Lifecycle Event)
2019-04-05 09:01:57.764 1 INFO nova.compute.manager [req-e6c146b2-f244-4476-9b00-c4101d32c343 - - - - -] [instance: 0e92d71f-8672-4cd0-a340-47ecf4d47362] During sync_power_state the instance has a pending task (spawning). Skip.

Then, 9:06:57 Nova gives up:

Nova Compute: 2019-04-05 09:06:57.652 1 WARNING nova.virt.libvirt.driver [req-5b87720d-d4bb-4838-945f-c69b2bef4913 0cdec46214b43190a0fb8a6bd09391d6d10fb8a4ac31cc822111be8b97f15315 c469ddd4363f4b96b8206f8c8904d2da - eb837e5e103c48dfb8a9542518598494 eb837e5e103c48dfb8a9542518598494] [instance: 0e92d71f-8672-4cd0-a340-47ecf4d47362] Timeout waiting for [('network-vif-plugged', u'fd51efe8-36fa-47df-aef9-89820cfff60a')] for instance with vm_state building and task_state spawning.: Timeout: 300 seconds
Nova Compute: 2019-04-05 09:06:58.011 1 INFO os_vif [req-5b87720d-d4bb-4838-945f-c69b2bef4913 0cdec46214b43190a0fb8a6bd09391d6d10fb8a4ac31cc822111be8b97f15315 c469ddd4363f4b96b8206f8c8904d2da - eb837e5e103c48dfb8a9542518598494 eb837e5e103c48dfb8a9542518598494] Successfully unplugged vif VIFBridge(active=False,address=fa:16:3e:2c:97:3c,bridge_name='qbrfd51efe8-36',has_traffic_filtering=True,id=fd51efe8-36fa-47df-aef9-89820cfff60a,network=Network(24965cb1-4d5d-463e-ad17-37ce65198e82),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tapfd51efe8-36')
Nova Compute: 2019-04-05 09:06:58.412 1 INFO os_vif [req-5b87720d-d4bb-4838-945f-c69b2bef4913 0cdec46214b43190a0fb8a6bd09391d6d10fb8a4ac31cc822111be8b97f15315 c469ddd4363f4b96b8206f8c8904d2da - eb837e5e103c48dfb8a9542518598494 eb837e5e103c48dfb8a9542518598494] Successfully unplugged vif VIFBridge(active=False,address=fa:16:3e:2c:97:3c,bridge_name='qbrfd51efe8-36',has_traffic_filtering=True,id=fd51efe8-36fa-47df-aef9-89820cfff60a,network=Network(24965cb1-4d5d-463e-ad17-37ce65198e82),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tapfd51efe8-36')

And then Neutron cleans up:

Neutron OVS Agent: 2019-04-05 09:06:58.666 62794 INFO neutron.agent.securitygroups_rpc [req-6f1d7492-a27d-4cce-b9dc-10b23aa91805 - - - - -] Remove device filter for set([u'fd51efe8-36fa-47df-aef9-89820cfff60a'])
Neutron OVS Agent: 2019-04-05 09:06:58.902 62794 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-6f1d7492-a27d-4cce-b9dc-10b23aa91805 - - - - -] Ports set([u'fd51efe8-36fa-47df-aef9-89820cfff60a']) removed
Neutron OVS Agent: 2019-04-05 09:07:02.666 62794 INFO neutron.agent.common.ovs_lib [req-6f1d7492-a27d-4cce-b9dc-10b23aa91805 - - - - -] Port fd51efe8-36fa-47df-aef9-89820cfff60a not present in bridge br-int
Neutron OVS Agent: 2019-04-05 09:07:02.668 62794 INFO neutron.agent.securitygroups_rpc [req-6f1d7492-a27d-4cce-b9dc-10b23aa91805 - - - - -] Remove device filter for ['fd51efe8-36fa-47df-aef9-89820cfff60a']

Comment 8 Cristian Muresanu 2019-04-06 17:52:00 UTC
We are back at 5 controllers and get this errors:

2019-04-06 13:29:55.059 49443 ERROR neutron.agent.linux.utils [req-f02afe12-bff8-4acf-b4dc-d75bca6c33fe - - - - -] Exit code: 125; Stdin: ; Stdout: neutron-dnsmasq-qdhcp-858f11f7-f56c-4f9c-94fe-bc1ddba435b0
2019-04-06 13:29:55.060 49443 ERROR neutron.agent.dhcp.agent [req-f02afe12-bff8-4acf-b4dc-d75bca6c33fe - - - - -] Unable to reload_allocations dhcp for 858f11f7-f56c-4f9c-94fe-bc1ddba435b0.: ProcessExecutionError: Exit code: 125; Stdin: ; Stdout: neutron-dnsmasq-qdhcp-858f11f7-f56c-4f9c-94fe-bc1ddba435b0
2019-04-06 13:29:55.060 49443 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2019-04-06 13:29:55.060 49443 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py", line 144, in call_driver
2019-04-06 13:29:55.060 49443 ERROR neutron.agent.dhcp.agent     getattr(driver, action)(**action_kwargs)
2019-04-06 13:29:55.060 49443 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 523, in reload_allocations
2019-04-06 13:29:55.060 49443 ERROR neutron.agent.dhcp.agent     self._spawn_or_reload_process(reload_with_HUP=True)
2019-04-06 13:29:55.060 49443 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 465, in _spawn_or_reload_process
2019-04-06 13:29:55.060 49443 ERROR neutron.agent.dhcp.agent     pm.enable(reload_cfg=reload_with_HUP)
2019-04-06 13:29:55.060 49443 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 88, in enable
2019-04-06 13:29:55.060 49443 ERROR neutron.agent.dhcp.agent     run_as_root=self.run_as_root)
2019-04-06 13:29:55.060 49443 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 903, in execute
2019-04-06 13:29:55.060 49443 ERROR neutron.agent.dhcp.agent     log_fail_as_error=log_fail_as_error, **kwargs)
2019-04-06 13:29:55.060 49443 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 151, in execute
2019-04-06 13:29:55.060 49443 ERROR neutron.agent.dhcp.agent     raise ProcessExecutionError(msg, returncode=returncode)
2019-04-06 13:29:55.060 49443 ERROR neutron.agent.dhcp.agent ProcessExecutionError: Exit code: 125; Stdin: ; Stdout: neutron-dnsmasq-qdhcp-858f11f7-f56c-4f9c-94fe-bc1ddba435b0
2019-04-06 13:29:55.060 49443 ERROR neutron.agent.dhcp.agent ; Stderr: Error response from daemon: removal of container neutron-dnsmasq-qdhcp-858f11f7-f56c-4f9c-94fe-bc1ddba435b0 is already in progress
2019-04-06 13:29:55.060 49443 ERROR neutron.agent.dhcp.agent /usr/bin/docker-current: Error response from daemon: Conflict. The container name "/neutron-dnsmasq-qdhcp-858f11f7-f56c-4f9c-94fe-bc1ddba435b0" is already in use by container e7564a5bed01c03bb54c776b71184944ac13a59a18bbb96dcb525a40e0a63b36. You have to remove (or rename) that container to be able to reuse that name..
2019-04-06 13:29:55.060 49443 ERROR neutron.agent.dhcp.agent See '/usr/bin/docker-current run --help'.
2019-04-06 13:29:55.060 49443 ERROR neutron.agent.dhcp.agent 
2019-04-06 13:29:55.060 49443 ERROR neutron.agent.dhcp.agent 
2019-04-06 13:29:55.321 49443 ERROR neutron.agent.linux.utils [req-5e824ddf-bcb7-4673-839f-64c6719a84bc - - - - -] Exit code: 125; Stdin: ; Stdout: neutron-dnsmasq-qdhcp-b3a9a970-93ba-4fb2-a65d-228825410339
2019-04-06 13:29:55.322 49443 ERROR neutron.agent.dhcp.agent [req-5e824ddf-bcb7-4673-839f-64c6719a84bc - - - - -] Unable to reload_allocations dhcp for b3a9a970-93ba-4fb2-a65d-228825410339.: ProcessExecutionError: Exit code: 125; Stdin: ; Stdout: neutron-dnsmasq-qdhcp-b3a9a970-93ba-4fb2-a65d-228825410339
2019-04-06 13:29:55.322 49443 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2019-04-06 13:29:55.322 49443 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py", line 144, in call_driver
2019-04-06 13:29:55.322 49443 ERROR neutron.agent.dhcp.agent     getattr(driver, action)(**action_kwargs)
2019-04-06 13:29:55.322 49443 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 523, in reload_allocations
2019-04-06 13:29:55.322 49443 ERROR neutron.agent.dhcp.agent     self._spawn_or_reload_process(reload_with_HUP=True)
2019-04-06 13:29:55.322 49443 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 465, in _spawn_or_reload_process
2019-04-06 13:29:55.322 49443 ERROR neutron.agent.dhcp.agent     pm.enable(reload_cfg=reload_with_HUP)
2019-04-06 13:29:55.322 49443 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 88, in enable
2019-04-06 13:29:55.322 49443 ERROR neutron.agent.dhcp.agent     run_as_root=self.run_as_root)
2019-04-06 13:29:55.322 49443 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 903, in execute
2019-04-06 13:29:55.322 49443 ERROR neutron.agent.dhcp.agent     log_fail_as_error=log_fail_as_error, **kwargs)
2019-04-06 13:29:55.322 49443 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 151, in execute
2019-04-06 13:29:55.322 49443 ERROR neutron.agent.dhcp.agent     raise ProcessExecutionError(msg, returncode=returncode)
2019-04-06 13:29:55.322 49443 ERROR neutron.agent.dhcp.agent ProcessExecutionError: Exit code: 125; Stdin: ; Stdout: neutron-dnsmasq-qdhcp-b3a9a970-93ba-4fb2-a65d-228825410339
2019-04-06 13:29:55.322 49443 ERROR neutron.agent.dhcp.agent ; Stderr: Error response from daemon: removal of container neutron-dnsmasq-qdhcp-b3a9a970-93ba-4fb2-a65d-228825410339 is already in progress
2019-04-06 13:29:55.322 49443 ERROR neutron.agent.dhcp.agent /usr/bin/docker-current: Error response from daemon: Conflict. The container name "/neutron-dnsmasq-qdhcp-b3a9a970-93ba-4fb2-a65d-228825410339" is already in use by container 1de96ccda61471378b7d6787b92a0551fce3680a02539778d00c58178f8d755a. You have to remove (or rename) that container to be able to reuse that name..
2019-04-06 13:29:55.322 49443 ERROR neutron.agent.dhcp.agent See '/usr/bin/docker-current run --help'.
2019-04-06 13:29:55.322 49443 ERROR neutron.agent.dhcp.agent 
2019-04-06 13:29:55.322 49443 ERROR neutron.agent.dhcp.agent

Comment 9 Peter Lemenkov 2019-04-06 20:00:40 UTC
Created attachment 1552983 [details]
SASL log from RabbitMQ with errors

Comment 10 Peter Lemenkov 2019-04-06 20:02:36 UTC
Created attachment 1552984 [details]
Full RabbitMQ log

Looks for 22-Mar-2019::11:35:53

Comment 11 Peter Lemenkov 2019-04-06 21:04:28 UTC
Hello All!

I dont have a full understanding of the situation but this is what can be found in RabbitMQ logs.

From Controller-4 (edited - removed node state and messages' contents):

=============

=INFO REPORT==== 22-Mar-2019::11:35:59 ===
Mirrored queue 'neutron-vo-SubPort-1.0_fanout_eee7dfb8a3734af8b48cfab63eb22a46' in vhost '/': Slave <rabbit.18679.0> saw deaths of mirrors <rabbit.31803.144>

=INFO REPORT==== 22-Mar-2019::11:35:59 ===
Mirrored queue 'neutron-vo-SecurityGroup-1.0.nspcloud-compute-92.localdomain' in vhost '/': Slave <rabbit.24771.0> saw deaths of mirrors <rabbit.25718.182>

=INFO REPORT==== 22-Mar-2019::11:35:59 ===
Mirrored queue 'heat-engine-listener.96e41c89-3edb-4184-ab06-ff2d6ad77a5f' in vhost '/': Slave <rabbit.13848.0> saw deaths of mirrors <rabbit.22937.157>

=INFO REPORT==== 22-Mar-2019::11:35:59 ===
Mirrored queue 'q-agent-notifier-tunnel-update_fanout_9ec8642922464d7d9b058540561b377e' in vhost '/': Slave <rabbit.19878.0> saw deaths of mirrors <rabbit.28515.144>

=INFO REPORT==== 22-Mar-2019::11:35:59 ===
Mirrored queue 'heat-engine-listener.7c7fa8ca-1aee-4461-ad9c-a0ffded2883a' in vhost '/': Slave <rabbit.12947.0> saw deaths of mirrors <rabbit.1370.126>

=INFO REPORT==== 22-Mar-2019::11:35:59 ===
Mirrored queue 'reply_7bec38edc06f430f8eacdd3003d75501' in vhost '/': Slave <rabbit.30551.0> saw deaths of mirrors <rabbit.17400.266>

=INFO REPORT==== 22-Mar-2019::11:35:59 ===
Mirrored queue 'neutron-vo-SubPort-1.0.nspcloud-compute-63.localdomain' in vhost '/': Slave <rabbit.20608.0> saw deaths of mirrors <rabbit.19856.0>

=INFO REPORT==== 22-Mar-2019::11:35:59 ===
Mirrored queue 'compute_fanout_5ca17ffaf2e74d2d8da1fed8b9913ea2' in vhost '/': Slave <rabbit.6010.0> saw deaths of mirrors <rabbit.5194.0>

=ERROR REPORT==== 22-Mar-2019::11:35:59 ===
** Generic server <0.945.1> terminating
** Last message in was {'$gen_cast',
                        {'$gm',80,
                         {activity,
                          {77,<29740.194.1>},
                          [{{0,<29739.27133.2>},
                            []}]}}}
** When Server state == {state,
                         #Fun<rabbit_misc.execute_mnesia_transaction.1>,false}
** Reason for termination == 
** {{badmatch,false},
    [{gm,last_pub,2,[{file,"src/gm.erl"},{line,1614}]},
     {gm,'-calculate_activity/5-fun-0-',7,[{file,"src/gm.erl"},{line,1463}]},
     {gm,with_member_acc,3,[{file,"src/gm.erl"},{line,1376}]},
     {lists,foldl,3,[{file,"lists.erl"},{line,1262}]},
     {gm,handle_msg,2,[{file,"src/gm.erl"},{line,873}]},
     {gm,handle_cast,2,[{file,"src/gm.erl"},{line,646}]},
     {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1047}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}

=============

This is clearly a GH#944. It means we have partial cluster split ended up with a multiple master nodes. So cluster if switching node simply one by one might not be able to synchronize. We need to destroy an offending part of the cluster manually (switch off, wipe data, switch on, cluster rejoin).


This explains why we see the following strange unhandled exceptions from Node-3:

===================

=INFO REPORT==== 22-Mar-2019::11:35:53 ===
Mirrored queue 'neutron-vo-Subnet-1.0_fanout_86a3490b4c8e49b2895afac60312780d' in vhost '/': Slave <rabbit.5422.0> saw deaths of mirrors <rabbit.21649.157>

=INFO REPORT==== 22-Mar-2019::11:35:53 ===
Mirrored queue 'neutron-vo-Trunk-1.1_fanout_798729be05604620adc9ecf2941a1bb2' in vhost '/': Slave <rabbit.1636.1> saw deaths of mirrors <rabbit.21382.157>

=INFO REPORT==== 22-Mar-2019::11:35:53 ===
Mirrored queue 'heat-engine-listener.5c81e9e2-6604-4bde-af12-fff80656aae7' in vhost '/': Slave <rabbit.18908.0> saw deaths of mirrors <rabbit.30326.125>

=INFO REPORT==== 22-Mar-2019::11:35:53 ===
Mirrored queue 'q-l3-plugin_fanout_bcdb3b80e65f44e68813aebe6be4afc2' in vhost '/': Slave <rabbit.31903.0> saw deaths of mirrors <rabbit.9407.0>

=INFO REPORT==== 22-Mar-2019::11:35:53 ===
Mirrored queue 'engine_worker.08405bb1-6fb1-4b65-9eab-8848fb7a964d' in vhost '/': Slave <rabbit.14086.0> saw deaths of mirrors <rabbit.4178.419>

=ERROR REPORT==== 22-Mar-2019::11:35:53 ===
** Generic server <0.24490.0> terminating
** Last message in was {'$gen_cast',
                           {'$gm',81,
                               {activity,
                                   {1,<29739.12087.113>},
                                   [{{0,<29737.537.172>},
                                     [],
                                     [314296,314297,314298]}]}}}
** When Server state == {state,
                            {78,<0.24490.0>},
                            {{1,<29739.12087.113>},#Ref<0.0.27.89579>},
                            {{62,<29740.25229.0>},#Ref<0.0.30.24219>},
                            {resource,<<"/">>,queue,
                                <<"neutron-vo-Port-1.1_fanout_2700dca41ac7416085378446b377e7e1">>},
                            rabbit_mirror_queue_slave,
                            {82,
                             [{{1,<29739.12087.113>},
                               {view_member,
                                   {1,<29739.12087.113>},
                                   [],
                                   {62,<29740.25229.0>},
                                   {78,<0.24490.0>}}},
                              {{62,<29740.25229.0>},
                               {view_member,
                                   {62,<29740.25229.0>},
                                   [{2,<29738.28860.137>}],
                                   {78,<0.24490.0>},
                                   {1,<29739.12087.113>}}},
                              {{78,<0.24490.0>},
                               {view_member,
                                   {78,<0.24490.0>},
                                   [],
                                   {1,<29739.12087.113>},
                                   {62,<29740.25229.0>}}}]},
                            1,
                            [{{1,<29739.12087.113>},{member,{[],[]},0,0}},
                             {{2,<29738.28860.137>},{member,{[],[]},0,0}},
                             {{62,<29740.25229.0>},{member,{[],[]},0,0}},
                             {{78,<0.24490.0>},
                              {member,
                                  {[{1,
                                     {ensure_monitoring,[<29737.3606.870>]}}],
                                   []},
                                  1,0}}],
                            [<0.24486.0>],
                            {[],[]},
                            [],0,undefined,#Ref<0.0.28.95928>,
                            #Fun<rabbit_misc.execute_mnesia_transaction.1>,
                            false}
** Reason for termination == 
** {badarg,[{queue,split,[3,{[],[]}],[{file,"queue.erl"},{line,321}]},
            {gm,apply_acks,2,[{file,"src/gm.erl"},{line,1601}]},
            {gm,'-calculate_activity/5-fun-0-',7,
                [{file,"src/gm.erl"},{line,1461}]},
            {gm,with_member_acc,3,[{file,"src/gm.erl"},{line,1376}]},
            {lists,foldl,3,[{file,"lists.erl"},{line,1262}]},
            {gm,handle_msg,2,[{file,"src/gm.erl"},{line,873}]},
            {gm,handle_cast,2,[{file,"src/gm.erl"},{line,646}]},
            {gen_server2,handle_msg,2,
                         [{file,"src/gen_server2.erl"},{line,1047}]}]}

=================== 

This node cannot sync the state (apply acks from the client).

Node-0 say us explicitly - cluster was split into parts (edited - removed node's state):

===================
=ERROR REPORT==== 22-Mar-2019::11:35:54 ===
Mnesia('rabbit@nspcloud-controller-0'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@nspcloud-controller-2'}

=ERROR REPORT==== 22-Mar-2019::11:35:54 ===
** Generic server <0.22971.741> terminating
** Last message in was {'$gen_cast',
                           {method,
                               {'queue.declare',0,<<"engine_worker">>,false,
                                   false,false,false,false,[]},
                               none,noflow}}
** Reason for termination ==
** {{badmatch,true},
    [{rabbit_amqqueue,'-with/4-fun-0-',5,
                      [{file,"src/rabbit_amqqueue.erl"},{line,468}]},
     {rabbit_channel,handle_method,3,
                     [{file,"src/rabbit_channel.erl"},{line,1323}]},
     {rabbit_channel,handle_cast,2,
                     [{file,"src/rabbit_channel.erl"},{line,470}]},
     {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1047}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}

===================


^^^ Btw this is a known problem but with a low severity in this case. The key thing here is "inconsistent_database, running_partitioned_network".


I see a messages about net_tick_timeout as well which means there is something going on on the underlying level which RabbitMQ just cannot handle. These network glitches could either be the root cause of the issue or simply increase its severity.


These are just my napkin notes - still I can't say much about the root cause.

Comment 13 David Hill 2019-04-07 20:46:33 UTC
It looks like we have to change the rpc_response_timeout value and change the number of api/rpc_workers in order to get dhcp-agent/ovs-agent to start properly.
They have 9000 ports , of which ~100 are down and ~1500 are in N/A state.   This looks like one of the issue we had with another customer (Desjardins) which had a bunch of down / unbound ports that was causing delays... and the issue with one of the controllers getting more load than the other guys.   Exactly the same situation but we had a problem with the disks on one of the controllers ... if I remember correctly.   So now they have a pretty stable/fast environment but we needed the below changes:

parameter_defaults:
  ControllerExtraConfig:
    neutron::rpc_response_timeout: 600
    neutron::server::api_workers: 28
    neutron::server::rpc_workers: 28

which I did manually with this:

sed -i 's/api_workers=12/api_workers=28/g' /var/lib/config-data/puppet-generated/neutron/etc/neutron/neutron.conf
sed -i 's/rpc_workers=12/rpc_workers=28/g' /var/lib/config-data/puppet-generated/neutron/etc/neutron/neutron.conf
sed -i 's/#rpc_response_timeout = 60/rpc_response_timeout = 600/g' /var/lib/config-data/puppet-generated/neutron/etc/neutron/neutron.conf
docker restart neutron_dhcp
docker restart neutron_api
docker restart neutron_ovs_agent


If we change the timeout values to something lower than 600 seconds, we get CPU pikes on controller-3 every time the timeout value is reached for each neutron calls.   I've seen heat::rpc_response_time is set by 600 by default... shouldn't we set the same value (or a similar one) in the tht ?

Comment 14 Peter Lemenkov 2019-04-07 22:42:41 UTC
Looking into RabbitMQ logs I don't see any new issues except for mentioned in comment 11.

Comment 17 Nate Johnston 2019-04-23 11:43:15 UTC
It looks like no SOS reports have been forthcoming as the system has remained stable with a higher timeout; the 100% CPU symptom has not returned.