Bug 1339364

Summary: Restarting dhcp-agent in large environment causes the agent to report down for extended time while namespaces are being re-manageed
Product: Red Hat OpenStack Reporter: Jeremy <jmelvin>
Component: openstack-neutronAssignee: Daniel Alvarez Sanchez <dalvarez>
Status: CLOSED ERRATA QA Contact: Eran Kuris <ekuris>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.0 (Kilo)CC: amuller, ccollett, chrisw, dalvarez, jmelvin, nyechiel, oblaut, srevivo
Target Milestone: ---Keywords: RFE, Triaged, Unconfirmed, ZStream
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-neutron-2015.1.4-13.el7ost Doc Type: Bug Fix
Doc Text:
Previously, because of the way the initial sync was done after starting the DHCP agent, periodic reports were not sent until the sync completed. In large environments this can be a lengthy process, and the lack of status reports during this process caused the agent to be marked as DOWN. This fix allows periodic reports to be sent during the DHCP agent initial sync, so it won't be marked as DOWN. It also addresses a bug which caused the initial sync to be performed twice, so the setup time has been halved.
Story Points: ---
Clone Of:
: 1433255 1433256 (view as bug list) Environment:
Last Closed: 2017-03-15 12:43:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1433255, 1433256    

Description Jeremy 2016-05-24 20:11:45 UTC
Description of problem: After re-starting neutorn-dhcp-agent the agent shows down in neutron agent-list while the namespaces are re-managed. In a large environment this can take hours, so alert monitoring software such as datadog gives false positives for the service being down even though it's up. Once all the namespaces are re-managed the agent shows up. It seems the agent should show status of up while the systemd service is active and namespaces are being re-managed.


Version-Release number of selected component (if applicable):
openstack-neutron-2015.1.0-16.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1.restart neutron-dhcp-agent
2. notice agent is down while re-managing namespaces.
3.

Actual results:
the agent shows down while systemctl status shows active

Expected results:
the agent shows up while systemctl status shows active


Additional info:


The neutron-dhcp-agent should try to check-in with neutron-server every X number of seconds or else it gets marked down.  When you run:

neutron agent-list | grep dhcp
| b63981ca-9775-4d4b-8617-f6486e766a5e | DHCP agent         | server003 | :-)   | True           | neutron-dhcp-agent        |
| f13a1c26-153a-44dd-bc57-5444cd1b8c34 | DHCP agent         | server001 | :-)   | True           | neutron-dhcp-agent        |
| f60c9cc8-c83d-4b0b-b819-36e9107a8131 | DHCP agent         | server002 | xxx   | True           | neutron-dhcp-agent        |

You'll see, in this case 'server002' is reporting down with an 'xxx' instead of a ':-)'.

We've noticed that when starting up a neutron-dhcp-agent it will try to re-manage all the existing namespaces, but this re-manage process will preempt the health reporting loop.   This causes it to be marked as down by the agent's  'manager' in the neutron-server,  until the agent has finished re-managing everything on that server.

You'll be able to see in the dhcp-agent.log messages like:
2016-05-20 19:27:58.251 506529 WARNING neutron.openstack.common.loopingcall [req-122e3f88-b837-4109-af8e-77cad0f2aeec ] task <bound method DhcpAgentWithStateReport._report_state of <neutron.agent.dhcp.agent.DhcpAgentWithStateReport object at 0x3945ed0>> run outlasted interval by 8449.14 sec

This shows that in this case the health loop was preempted for 2.3hrs.

The code for the health check looks to be around here: https://github.com/openstack/neutron/blob/stable/kilo/neutron/agent/dhcp/agent.py#L570-L588

In our case,  it takes several hours for all the namespaces to be re-managed and during that time the dhcp-agent will be unavailable.    It would seem to make more sense that the health monitor could preempt the rebuild effort so the agent could show that  it is available while it is starting up.

Comment 1 Jeremy 2016-05-24 20:34:25 UTC
update package versions:

On control plane nodes (neutron-server runs here):
rpm -qa|grep neutron
python-neutron-lbaas-2015.1.2-1.el7ost.noarch
python-neutron-2015.1.2-13.el7ost.noarch
python-neutron-vpnaas-2015.1.2-1.el7ost.noarch
openstack-neutron-vpnaas-2015.1.2-1.el7ost.noarch
openstack-neutron-lbaas-2015.1.2-1.el7ost.noarch
openstack-neutron-ml2-2015.1.2-13.el7ost.noarch
openstack-neutron-common-2015.1.2-13.el7ost.noarch
python-neutronclient-2.4.0-2.el7ost.noarch
openstack-neutron-2015.1.2-13.el7ost.noarch
openstack-neutron-openvswitch-2015.1.2-9.el7ost.noarch


On network nodes (dhcp-agent runs here):
rpm -qa|grep neutron
openstack-neutron-common-2015.1.2-13.el7ost.noarch
openstack-neutron-openvswitch-2015.1.2-11.el7ost.noarch
openstack-neutron-lbaas-2015.1.2-1.el7ost.noarch
python-neutronclient-2.4.0-2.el7ost.noarch
openstack-neutron-2015.1.2-13.el7ost.noarch
python-neutron-lbaas-2015.1.2-1.el7ost.noarch
openstack-neutron-ml2-2015.1.2-13.el7ost.noarch
python-neutron-2015.1.2-13.el7ost.noarch

Comment 6 Assaf Muller 2016-10-26 13:30:44 UTC
Daniel, can you please try to reproduce on master? I could not spot differences between the Neutron and Oslo.service loopingcall code on master, and the Neutron and incubated oslo service code in Neutron on Kilo. I don't see a reason at this time why this issue would not be applicable to master as well. Let's try to reproduce and root cause.

Comment 7 Daniel Alvarez Sanchez 2016-11-14 16:42:14 UTC
@Assaf: I have checked with Miguel and it might have something to do with the rootwrap daemon rather than differences in the loopingcall implementations. This would explain why the issue doesn't show up in later releases where the daemon was enabled by default. 

The thing is that if rootwrap-daemon is not enabled, then every command issued to remanage the namespaces will take a few seconds each and that, in large environments, could be the root cause of big delays (please, see [1]). With the daemon, all the commands go through it reducing the times a few orders of magnitude.

@jeremy:

* Could you please check whether rootwrap-daemon process is running? (or maybe you have a sosreport where i can look at myself?)

If enabled, you should be able to see an output like this:

[root@osp7 cloud-user(keystone_admin)]# ps fax | grep neutron-rootwrap-daemon 19478 ?        S      0:00  \_ sudo neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
19479 ?        Sl     4:56  |   \_ /usr/bin/python2 /usr/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
12657 ?        S      0:00  \_ sudo neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
12658 ?        Sl     0:00      \_ /usr/bin/python2 /usr/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf


(In reply to Jeremy from comment #0)
> In our case,  it takes several hours for all the namespaces to be re-managed
> and during that time the dhcp-agent will be unavailable.    It would seem to
> make more sense that the health monitor could preempt the rebuild effort so
> the agent could show that  it is available while it is starting up.

Because of neutron architecture, the agent is not able to process any messages until all the setup (remanaging namespaces) is done. For that reason, it has to report its down state to neutron server so that it's aware and can forward new networks to some other agent which is available.

Thanks,
Daniel

[1] https://specs.openstack.org/openstack/neutron-specs/specs/kilo/rootwrap-daemon-mode.html#problem-description

Comment 17 Daniel Alvarez Sanchez 2016-12-16 16:20:18 UTC
As we may have a solution, I have reported the bug upstream [0] to see what others think about. Hopefully we can come up with a solution soon.

[0] https://bugs.launchpad.net/neutron/+bug/1650611

Comment 18 Daniel Alvarez Sanchez 2017-01-05 10:42:39 UTC
I have reviewed this patch [0] and it will solve this bug but also will make dhcp agent to restart faster since there was another bug which caused the initial sync to be done twice.

Once merged, I can backport it.

[0] https://review.openstack.org/#/c/413010/

Comment 23 Eran Kuris 2017-03-09 12:38:32 UTC
[root@controller-2 ~]# rpm -qa |grep openstack-neutron
openstack-neutron-ml2-2015.1.4-13.el7ost.noarch
openstack-neutron-common-2015.1.4-13.el7ost.noarch
openstack-neutron-openvswitch-2015.1.4-13.el7ost.noarch
openstack-neutron-lbaas-2015.1.4-1.el7ost.noarch
openstack-neutron-bigswitch-lldp-2015.1.38-1.el7ost.noarch
openstack-neutron-metering-agent-2015.1.4-13.el7ost.noarch
openstack-neutron-2015.1.4-13.el7ost.noarch

verified on osp7 with director deployment

Comment 25 errata-xmlrpc 2017-03-15 12:43:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0528.html