Description of problem: There appears to be a udev rule in the images on the rhos-director deployment of osp7 named: /etc/udev/rules.d/99-dhcp-all-interfaces.rules that is creating a slow degradation of performance on the controllers in the overcloud. For every vm that is launched, there appears to get created a failed dhcp systemd resource (visible in the output of "systemctl"). Over time the number of systemd resources on the controllers grows to several thousand, and the performance degrades to the point where services start to fail, and floating IP allocations don't work (or the time until ping and ssh to floating IP works is so horrible that any orchestration method fails, be it rally, jenkins, or ansible). When the environment is first rebooted, launching guests is snappy, and there are good results for testing launch times. The observed system load on the controllers when launching 20 simultaneous guests is only around 5 or 6. Over the course of 2-3 hours of testing, the failed systemctl resource count goes from a couple of hundred to several thousand, and the system load on the controllers jumps to 80-100 when the same 20 guests are launched. We are currently removing the above udev rule and have done a couple of iterations of testing. Thus far I have not seen a degradation of controller performance over time, and no failed systemd services are being generated. Kambiz RHOS Director 7.1 Launch guests over time and look for the following on the controllers: systemctl | grep dhcp | grep fail | wc -l If the number goes up, you will likely be seeing this issue.
Here is the offending udev rule: SUBSYSTEM=="net", ACTION=="add", TAG+="systemd", ENV{SYSTEMD_WANTS}+="dhcp-interface@$name.service"
This udev rule is primarily used for provisioning in order to bootstrap a working DHCP configuration on any real interface. This is required during initial provisioning. ---- The udev rules file here is meant to trigger a shell script which autodetects if the interface MAC is generated (most vm's would have generated MAC addresses). If the device has a generated MAC the script should exit quickly: http://git.openstack.org/cgit/openstack/diskimage-builder/tree/elements/dhcp-all-interfaces/install.d/dhcp-all-interfaces.sh#n73 I'd be curious to know why that check isn't working? Or if it was working did we break it at some point? --- Regardless, after initial provisioning is finished I see no harm in removing the udev rules file for dhcp-all-interfaces. The only downside would be the case where we lost all network connectivity and wanted to boot from DHCP again by removing all /etc/sysconfig/network/ifcfg-* files and then rebooting or something. A couple of options here would be: 1) figure out why dhcp-all-interfaces.sh is starting DHCP on "virtual" interfaces. (perhaps our /sys/class/net/${interface}/addr_assign_type is invalid?). 2) use dhcp-all-interfaces.sh for initial provisioning and then remove it afterwards. This would potentially break the case where you want a network config reset that relied on a known to have worked once DHCP connection.
I hate this bug so much. Dan, could you just fix it? Thanks.
(In reply to Hugh Brock from comment #5) > I hate this bug so much. Dan, could you just fix it? Thanks. Sure, I've got some ideas for evaluating physical vs. virtual interfaces.
A potential fix/work around has been posted here: https://review.openstack.org/272697 It would be nice to have more information from Kambiz (if possible) about why dhcp-all-interfaces.sh is leaking. Specifically if our /sys/class/net/${interface}/addr_assign_type check in this script isn't quite specific enough.
What we were seeing is an ever increasing number of failed systemd resources over time as networks were created and deleted. The output of "systemctl" was an ever increasing list that included netns related services, all were failed (which may or may not be important), but I suspect this somehow impacted pacemaker as well.
*** Bug 1299083 has been marked as a duplicate of this bug. ***
I reproduce the bug on OSP 7.1 The failed resource are present on compute + controller (where interface are create/deleted). The failed resource appear when neutron delete the port (the dhcp-interface services are never deleted). On our plateform when the resource failed go over 10k, systemd start to be unusable and the only thing to do is a power cycle. To cleanup an existing system with failed resource we used "systemctl reset-failed" (it worked if the failed ressources are < 10k). I just applied the Dan's patch, we don't get any dhcp-interface created, so no more failed service.
This bug did not make the OSP 8.0 release. It is being deferred to OSP 10.
verified with: openstack-tripleo-image-elements-0.9.9-2.el7ost.noarch code merged to: /usr/share/tripleo-image-elements/os-net-config/os-refresh-config/configure.d/20-os-net-config following : Launch guests over time and look for the following on the controllers: systemctl | grep dhcp | grep fail | wc -l If the number goes up, you will likely be seeing this issue. Results: Every 2.0s: systemctl | grep dhcp | grep fail | wc -l Tue Apr 19 15:52:27 2016 2
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0653.html