Bug 1809634

Summary: [osp16] ERROR neutron.agent.dhcp.agent neutron_lib.exceptions.ProcessExecutionError: Exit code: 1; Stdin: ; Stdout: ; Stderr:
Product: Red Hat OpenStack Reporter: Chris Janiszewski <cjanisze>
Component: tripleo-ansibleAssignee: Rodolfo Alonso <ralonsoh>
Status: CLOSED CURRENTRELEASE QA Contact: Candido Campos <ccamposr>
Severity: urgent Docs Contact:
Priority: high    
Version: 16.1 (Train)CC: bcafarel, bhaley, chrisw, jschluet, kholden, mweetman, oblaut, racedoro, ralonsoh, rdiwakar, scohen, stchen, vkoul
Target Milestone: ---Keywords: Reopened, TestOnly, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tripleo-ansible-0.5.1-0.20200528013427.9da6a00.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-01 10:33:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1810475    

Description Chris Janiszewski 2020-03-03 15:19:10 UTC
Description of problem:
Few hours after controller reboot, the deployment of instances started failing with:
Error: Failed to perform requested operation on instance "ocpra-8q78n-master-0", the instance has an error status: Please try again later [Error: Build of instance 2e77e7f8-69bc-477d-99e4-10ff94cf59ac aborted: Failed to allocate the network(s), not rescheduling.].


dhcp-agent.log shows a stream of:

2020-03-03 14:00:47.892 6397 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2020-03-03 14:00:47.892 6397 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 218, in sync_state
2020-03-03 14:00:47.892 6397 ERROR neutron.agent.dhcp.agent     self.disable_dhcp_helper(deleted_id)
2020-03-03 14:00:47.892 6397 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/osprofiler/profiler.py", line 160, in wrapper
2020-03-03 14:00:47.892 6397 ERROR neutron.agent.dhcp.agent     result = f(*args, **kwargs)
2020-03-03 14:00:47.892 6397 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 357, in disable_dhcp_helper
2020-03-03 14:00:47.892 6397 ERROR neutron.agent.dhcp.agent     self.disable_isolated_metadata_proxy(network)
2020-03-03 14:00:47.892 6397 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/osprofiler/profiler.py", line 160, in wrapper
2020-03-03 14:00:47.892 6397 ERROR neutron.agent.dhcp.agent     result = f(*args, **kwargs)
2020-03-03 14:00:47.892 6397 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 700, in disable_isolated_metadata_proxy
2020-03-03 14:00:47.892 6397 ERROR neutron.agent.dhcp.agent     self._process_monitor, uuid, self.conf, network.namespace)
2020-03-03 14:00:47.892 6397 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/metadata/driver.py", line 250, in destroy_monitored_metadata_proxy
2020-03-03 14:00:47.892 6397 ERROR neutron.agent.dhcp.agent     pm.disable()
2020-03-03 14:00:47.892 6397 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py", line 113, in disable
2020-03-03 14:00:47.892 6397 ERROR neutron.agent.dhcp.agent     utils.execute(cmd, run_as_root=self.run_as_root)
2020-03-03 14:00:47.892 6397 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py", line 147, in execute
2020-03-03 14:00:47.892 6397 ERROR neutron.agent.dhcp.agent     returncode=returncode)
2020-03-03 14:00:47.892 6397 ERROR neutron.agent.dhcp.agent neutron_lib.exceptions.ProcessExecutionError: Exit code: 1; Stdin: ; Stdout: ; Stderr: 
2020-03-03 14:00:47.892 6397 ERROR neutron.agent.dhcp.agent 



Version-Release number of selected component (if applicable):
OSP16

How reproducible:
Intermittent 

Steps to Reproduce:
1.
2.
3.

Actual results:
Error

Expected results:
Success

Additional info:
sosreport -> http://chrisj.cloud/sosreport-chrisj-osp16-controller-0-2020-03-03-hzreywy.tar.xz

Comment 1 Chris Janiszewski 2020-03-05 09:35:19 UTC
One thing that have happened in this environment is the dhcp neutron port have disappeared and maybe it's related to this bug

(chrisj-osp16) [stack@undercloud-osp16 ~]$ openstack subnet show baremetal-sub
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field             | Value                                                                                                                                                            |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| allocation_pools  | 172.31.10.70-172.31.10.199                                                                                                                                       |
| cidr              | 172.31.10.0/24                                                                                                                                                   |
| created_at        | 2020-02-27T21:14:55Z                                                                                                                                             |
| description       |                                                                                                                                                                  |
| dns_nameservers   | 172.31.8.1                                                                                                                                                       |
| enable_dhcp       | True                                                                                                                                                             |
| gateway_ip        | 172.31.10.254                                                                                                                                                    |
| host_routes       |                                                                                                                                                                  |
| id                | e69757d3-e3c7-4634-8657-726c19883017                                                                                                                             |
| ip_version        | 4                                                                                                                                                                |
| ipv6_address_mode | None                                                                                                                                                             |
| ipv6_ra_mode      | None                                                                                                                                                             |
| location          | cloud='', project.domain_id=, project.domain_name='Default', project.id='f349b269b5bd470f83ffb18fdc1cc049', project.name='admin', region_name='regionOne', zone= |
| name              | baremetal-sub                                                                                                                                                    |
| network_id        | 45566c04-9a73-4736-acb5-abd040e63bed                                                                                                                             |
| prefix_length     | None                                                                                                                                                             |
| project_id        | f349b269b5bd470f83ffb18fdc1cc049                                                                                                                                 |
| revision_number   | 0                                                                                                                                                                |
| segment_id        | None                                                                                                                                                             |
| service_types     |                                                                                                                                                                  |
| subnetpool_id     | None                                                                                                                                                             |
| tags              |                                                                                                                                                                  |
| updated_at        | 2020-02-27T21:14:55Z                                                                                                                                             |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
(chrisj-osp16) [stack@undercloud-osp16 ~]$ for i in `openstack port list | awk '/172.31.10/ {print $2}'`; do openstack port show $i | grep device_owner; done
| device_owner            | baremetal:none                                                                                                                                                   |
| device_owner            | compute:nova                                                                                                                                     |
| device_owner            |

Comment 2 Rodolfo Alonso 2020-03-23 11:49:06 UTC
Hello Chris:

I debugged this issue with Slawek and we found that, in case the namespace is not found, the haproxy kill script will fail and trigger again the resync process. This resync process will loop again and again returning always the same error:
"""
2020-03-03 13:59:42.147 6397 INFO neutron.agent.dhcp.agent [req-7555cc29-6a4a-4010-9211-25d45da0ea5c - - - - -] Synchronizing state complete
2020-03-03 13:59:42.147 6397 INFO neutron.agent.dhcp.agent [req-7555cc29-6a4a-4010-9211-25d45da0ea5c - - - - -] Synchronizing state
2020-03-03 13:59:42.438 6397 INFO neutron.agent.dhcp.agent [req-51828289-eee2-404c-9243-06be02eaca70 - - - - -] All active networks have been fetched through RPC.
2020-03-03 13:59:42.450 6397 ERROR neutron.agent.linux.utils [req-51828289-eee2-404c-9243-06be02eaca70 - - - - -] Exit code: 1; Stdin: ; Stdout: ; Stderr:
2020-03-03 13:59:42.451 6397 ERROR neutron.agent.dhcp.agent [req-51828289-eee2-404c-9243-06be02eaca70 - - - - -] Unable to sync network state on deleted network 1b2621fe-38c7-4572-9b5f-45b2f0687f94: neutron_lib.exceptions.ProcessExecutionError: Exit code: 1; Stdin: ; Stdout: ; Stderr:
2020-03-03 13:59:42.451 6397 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2020-03-03 13:59:42.451 6397 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 218, in sync_state
2020-03-03 13:59:42.451 6397 ERROR neutron.agent.dhcp.agent     self.disable_dhcp_helper(deleted_id)
2020-03-03 13:59:42.451 6397 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/osprofiler/profiler.py", line 160, in wrapper
2020-03-03 13:59:42.451 6397 ERROR neutron.agent.dhcp.agent     result = f(*args, **kwargs)
2020-03-03 13:59:42.451 6397 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 357, in disable_dhcp_helper
2020-03-03 13:59:42.451 6397 ERROR neutron.agent.dhcp.agent     self.disable_isolated_metadata_proxy(network)
2020-03-03 13:59:42.451 6397 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/osprofiler/profiler.py", line 160, in wrapper
2020-03-03 13:59:42.451 6397 ERROR neutron.agent.dhcp.agent     result = f(*args, **kwargs)
2020-03-03 13:59:42.451 6397 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 700, in disable_isolated_metadata_proxy
2020-03-03 13:59:42.451 6397 ERROR neutron.agent.dhcp.agent     self._process_monitor, uuid, self.conf, network.namespace)
2020-03-03 13:59:42.451 6397 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/metadata/driver.py", line 250, in destroy_monitored_metadata_proxy
2020-03-03 13:59:42.451 6397 ERROR neutron.agent.dhcp.agent     pm.disable()
2020-03-03 13:59:42.451 6397 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/linux/external_process.py", line 113, in disable
2020-03-03 13:59:42.451 6397 ERROR neutron.agent.dhcp.agent     utils.execute(cmd, run_as_root=self.run_as_root)
2020-03-03 13:59:42.451 6397 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py", line 147, in execute
2020-03-03 13:59:42.451 6397 ERROR neutron.agent.dhcp.agent     returncode=returncode)
2020-03-03 13:59:42.451 6397 ERROR neutron.agent.dhcp.agent neutron_lib.exceptions.ProcessExecutionError: Exit code: 1; Stdin: ; Stdout: ; Stderr:
2020-03-03 13:59:42.451 6397 ERROR neutron.agent.dhcp.agent
"""


At the same time, this is the log of the kill scripts:
"""
Tue Mar  3 13:59:29 UTC 2020 Deleting container neutron-dnsmasq-qdhcp-3d1b1411-f8d2-4348-a500-e7af340f2d1c (6a2ac5c799fb7175b20ffa768b2b297f2c9de8e3f5271d99cf41ccff3dc94b64)
6a2ac5c799fb7175b20ffa768b2b297f2c9de8e3f5271d99cf41ccff3dc94b64
Tue Mar  3 13:59:30 UTC 2020 No network namespace detected, exiting
Tue Mar  3 13:59:30 UTC 2020 No network namespace detected, exiting
Tue Mar  3 13:59:31 UTC 2020 No network namespace detected, exiting
"""


I'm going to propose a patch for tripleo-ansible (tripleo-heat-templates in OSP16) to solve this issue in the haproxy kill script.

Do you still have this deployment? Is it possible to debug this issue in your environment?

Regards and thank you in advance.

Comment 6 Chris Janiszewski 2020-03-23 15:22:48 UTC
Hi Rodolfo,

Thanks for looking into this and finding underlying issue. Unfortunately I had to use this environment for other workload and it's currently not available  to put a patch. 
I apologize for the inconvenience.

Comment 8 Rodolfo Alonso 2020-04-09 14:49:16 UTC
*** Bug 1810475 has been marked as a duplicate of this bug. ***

Comment 10 Jon Schlueter 2020-05-20 15:07:40 UTC
patch is included in latest released build in OSP 16.0

Comment 11 Lon Hohberger 2020-05-22 10:43:46 UTC
According to our records, this should be resolved by tripleo-ansible-0.4.2-0.20200404124615.67005aa.el8ost.  This build is available now.

Comment 17 stchen 2020-11-04 20:11:28 UTC
Closing EOL, OSP 16.0 has been retired as of Oct 27, 2020

Comment 18 Brian Haley 2020-11-09 22:41:17 UTC
Re-opening targeted at 16.1