Description of problem: After re-starting neutorn-dhcp-agent the agent shows down in neutron agent-list while the namespaces are re-managed. In a large environment this can take hours, so alert monitoring software such as datadog gives false positives for the service being down even though it's up. Once all the namespaces are re-managed the agent shows up. It seems the agent should show status of up while the systemd service is active and namespaces are being re-managed.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
2. notice agent is down while re-managing namespaces.
the agent shows down while systemctl status shows active
the agent shows up while systemctl status shows active
The neutron-dhcp-agent should try to check-in with neutron-server every X number of seconds or else it gets marked down. When you run:
neutron agent-list | grep dhcp
| b63981ca-9775-4d4b-8617-f6486e766a5e | DHCP agent | server003 | :-) | True | neutron-dhcp-agent |
| f13a1c26-153a-44dd-bc57-5444cd1b8c34 | DHCP agent | server001 | :-) | True | neutron-dhcp-agent |
| f60c9cc8-c83d-4b0b-b819-36e9107a8131 | DHCP agent | server002 | xxx | True | neutron-dhcp-agent |
You'll see, in this case 'server002' is reporting down with an 'xxx' instead of a ':-)'.
We've noticed that when starting up a neutron-dhcp-agent it will try to re-manage all the existing namespaces, but this re-manage process will preempt the health reporting loop. This causes it to be marked as down by the agent's 'manager' in the neutron-server, until the agent has finished re-managing everything on that server.
You'll be able to see in the dhcp-agent.log messages like:
2016-05-20 19:27:58.251 506529 WARNING neutron.openstack.common.loopingcall [req-122e3f88-b837-4109-af8e-77cad0f2aeec ] task <bound method DhcpAgentWithStateReport._report_state of <neutron.agent.dhcp.agent.DhcpAgentWithStateReport object at 0x3945ed0>> run outlasted interval by 8449.14 sec
This shows that in this case the health loop was preempted for 2.3hrs.
The code for the health check looks to be around here: https://github.com/openstack/neutron/blob/stable/kilo/neutron/agent/dhcp/agent.py#L570-L588
In our case, it takes several hours for all the namespaces to be re-managed and during that time the dhcp-agent will be unavailable. It would seem to make more sense that the health monitor could preempt the rebuild effort so the agent could show that it is available while it is starting up.
update package versions:
On control plane nodes (neutron-server runs here):
rpm -qa|grep neutron
On network nodes (dhcp-agent runs here):
rpm -qa|grep neutron
Daniel, can you please try to reproduce on master? I could not spot differences between the Neutron and Oslo.service loopingcall code on master, and the Neutron and incubated oslo service code in Neutron on Kilo. I don't see a reason at this time why this issue would not be applicable to master as well. Let's try to reproduce and root cause.
@Assaf: I have checked with Miguel and it might have something to do with the rootwrap daemon rather than differences in the loopingcall implementations. This would explain why the issue doesn't show up in later releases where the daemon was enabled by default.
The thing is that if rootwrap-daemon is not enabled, then every command issued to remanage the namespaces will take a few seconds each and that, in large environments, could be the root cause of big delays (please, see ). With the daemon, all the commands go through it reducing the times a few orders of magnitude.
* Could you please check whether rootwrap-daemon process is running? (or maybe you have a sosreport where i can look at myself?)
If enabled, you should be able to see an output like this:
[root@osp7 cloud-user(keystone_admin)]# ps fax | grep neutron-rootwrap-daemon 19478 ? S 0:00 \_ sudo neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
19479 ? Sl 4:56 | \_ /usr/bin/python2 /usr/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
12657 ? S 0:00 \_ sudo neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
12658 ? Sl 0:00 \_ /usr/bin/python2 /usr/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
(In reply to Jeremy from comment #0)
> In our case, it takes several hours for all the namespaces to be re-managed
> and during that time the dhcp-agent will be unavailable. It would seem to
> make more sense that the health monitor could preempt the rebuild effort so
> the agent could show that it is available while it is starting up.
Because of neutron architecture, the agent is not able to process any messages until all the setup (remanaging namespaces) is done. For that reason, it has to report its down state to neutron server so that it's aware and can forward new networks to some other agent which is available.
As we may have a solution, I have reported the bug upstream  to see what others think about. Hopefully we can come up with a solution soon.
I have reviewed this patch  and it will solve this bug but also will make dhcp agent to restart faster since there was another bug which caused the initial sync to be done twice.
Once merged, I can backport it.
[root@controller-2 ~]# rpm -qa |grep openstack-neutron
verified on osp7 with director deployment
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.