Bug 1339364
Summary: | Restarting dhcp-agent in large environment causes the agent to report down for extended time while namespaces are being re-manageed | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Jeremy <jmelvin> | |
Component: | openstack-neutron | Assignee: | Daniel Alvarez Sanchez <dalvarez> | |
Status: | CLOSED ERRATA | QA Contact: | Eran Kuris <ekuris> | |
Severity: | medium | Docs Contact: | ||
Priority: | medium | |||
Version: | 7.0 (Kilo) | CC: | amuller, ccollett, chrisw, dalvarez, jmelvin, nyechiel, oblaut, srevivo | |
Target Milestone: | --- | Keywords: | RFE, Triaged, Unconfirmed, ZStream | |
Target Release: | 7.0 (Kilo) | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | openstack-neutron-2015.1.4-13.el7ost | Doc Type: | Bug Fix | |
Doc Text: |
Previously, because of the way the initial sync was done after starting the DHCP agent, periodic reports were not sent until the sync completed. In large environments this can be a lengthy process, and the lack of status reports during this process caused the agent to be marked as DOWN. This fix allows periodic reports to be sent during the DHCP agent initial sync, so it won't be marked as DOWN. It also addresses a bug which caused the initial sync to be performed twice, so the setup time has been halved.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1433255 1433256 (view as bug list) | Environment: | ||
Last Closed: | 2017-03-15 12:43:49 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1433255, 1433256 |
Description
Jeremy
2016-05-24 20:11:45 UTC
update package versions: On control plane nodes (neutron-server runs here): rpm -qa|grep neutron python-neutron-lbaas-2015.1.2-1.el7ost.noarch python-neutron-2015.1.2-13.el7ost.noarch python-neutron-vpnaas-2015.1.2-1.el7ost.noarch openstack-neutron-vpnaas-2015.1.2-1.el7ost.noarch openstack-neutron-lbaas-2015.1.2-1.el7ost.noarch openstack-neutron-ml2-2015.1.2-13.el7ost.noarch openstack-neutron-common-2015.1.2-13.el7ost.noarch python-neutronclient-2.4.0-2.el7ost.noarch openstack-neutron-2015.1.2-13.el7ost.noarch openstack-neutron-openvswitch-2015.1.2-9.el7ost.noarch On network nodes (dhcp-agent runs here): rpm -qa|grep neutron openstack-neutron-common-2015.1.2-13.el7ost.noarch openstack-neutron-openvswitch-2015.1.2-11.el7ost.noarch openstack-neutron-lbaas-2015.1.2-1.el7ost.noarch python-neutronclient-2.4.0-2.el7ost.noarch openstack-neutron-2015.1.2-13.el7ost.noarch python-neutron-lbaas-2015.1.2-1.el7ost.noarch openstack-neutron-ml2-2015.1.2-13.el7ost.noarch python-neutron-2015.1.2-13.el7ost.noarch Daniel, can you please try to reproduce on master? I could not spot differences between the Neutron and Oslo.service loopingcall code on master, and the Neutron and incubated oslo service code in Neutron on Kilo. I don't see a reason at this time why this issue would not be applicable to master as well. Let's try to reproduce and root cause. @Assaf: I have checked with Miguel and it might have something to do with the rootwrap daemon rather than differences in the loopingcall implementations. This would explain why the issue doesn't show up in later releases where the daemon was enabled by default. The thing is that if rootwrap-daemon is not enabled, then every command issued to remanage the namespaces will take a few seconds each and that, in large environments, could be the root cause of big delays (please, see [1]). With the daemon, all the commands go through it reducing the times a few orders of magnitude. @jeremy: * Could you please check whether rootwrap-daemon process is running? (or maybe you have a sosreport where i can look at myself?) If enabled, you should be able to see an output like this: [root@osp7 cloud-user(keystone_admin)]# ps fax | grep neutron-rootwrap-daemon 19478 ? S 0:00 \_ sudo neutron-rootwrap-daemon /etc/neutron/rootwrap.conf 19479 ? Sl 4:56 | \_ /usr/bin/python2 /usr/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf 12657 ? S 0:00 \_ sudo neutron-rootwrap-daemon /etc/neutron/rootwrap.conf 12658 ? Sl 0:00 \_ /usr/bin/python2 /usr/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf (In reply to Jeremy from comment #0) > In our case, it takes several hours for all the namespaces to be re-managed > and during that time the dhcp-agent will be unavailable. It would seem to > make more sense that the health monitor could preempt the rebuild effort so > the agent could show that it is available while it is starting up. Because of neutron architecture, the agent is not able to process any messages until all the setup (remanaging namespaces) is done. For that reason, it has to report its down state to neutron server so that it's aware and can forward new networks to some other agent which is available. Thanks, Daniel [1] https://specs.openstack.org/openstack/neutron-specs/specs/kilo/rootwrap-daemon-mode.html#problem-description As we may have a solution, I have reported the bug upstream [0] to see what others think about. Hopefully we can come up with a solution soon. [0] https://bugs.launchpad.net/neutron/+bug/1650611 I have reviewed this patch [0] and it will solve this bug but also will make dhcp agent to restart faster since there was another bug which caused the initial sync to be done twice. Once merged, I can backport it. [0] https://review.openstack.org/#/c/413010/ [root@controller-2 ~]# rpm -qa |grep openstack-neutron openstack-neutron-ml2-2015.1.4-13.el7ost.noarch openstack-neutron-common-2015.1.4-13.el7ost.noarch openstack-neutron-openvswitch-2015.1.4-13.el7ost.noarch openstack-neutron-lbaas-2015.1.4-1.el7ost.noarch openstack-neutron-bigswitch-lldp-2015.1.38-1.el7ost.noarch openstack-neutron-metering-agent-2015.1.4-13.el7ost.noarch openstack-neutron-2015.1.4-13.el7ost.noarch verified on osp7 with director deployment Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0528.html |