| Summary: | /etc/udev/rules.d/99-dhcp-all-interfaces.rules causes a slow and miserable degradation until things fail | |||
|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Kambiz Aghaiepour <kambiz> | |
| Component: | openstack-tripleo-image-elements | Assignee: | Dan Prince <dprince> | |
| Status: | CLOSED ERRATA | QA Contact: | Omri Hochman <ohochman> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | 7.0 (Kilo) | CC: | athomas, clecomte, dprince, dsneddon, ggillies, hbrock, jcoufal, jtaleric, mburns, mcornea, nbarcet, ohochman, pbarta, rhel-osp-director-maint, swaite, vcojot | |
| Target Milestone: | async | Keywords: | Triaged | |
| Target Release: | 8.0 (Liberty) | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | openstack-tripleo-image-elements-0.9.9-2.el7ost | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1299083 1383678 (view as bug list) | Environment: | ||
| Last Closed: | 2016-04-20 13:04:05 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Bug Depends On: | ||||
| Bug Blocks: | 1299083, 1383678 | |||
|
Description
Kambiz Aghaiepour
2015-12-22 19:24:00 UTC
Here is the offending udev rule:
SUBSYSTEM=="net", ACTION=="add", TAG+="systemd", ENV{SYSTEMD_WANTS}+="dhcp-interface@$name.service"
This udev rule is primarily used for provisioning in order to bootstrap a working DHCP configuration on any real interface. This is required during initial provisioning. ---- The udev rules file here is meant to trigger a shell script which autodetects if the interface MAC is generated (most vm's would have generated MAC addresses). If the device has a generated MAC the script should exit quickly: http://git.openstack.org/cgit/openstack/diskimage-builder/tree/elements/dhcp-all-interfaces/install.d/dhcp-all-interfaces.sh#n73 I'd be curious to know why that check isn't working? Or if it was working did we break it at some point? --- Regardless, after initial provisioning is finished I see no harm in removing the udev rules file for dhcp-all-interfaces. The only downside would be the case where we lost all network connectivity and wanted to boot from DHCP again by removing all /etc/sysconfig/network/ifcfg-* files and then rebooting or something. A couple of options here would be: 1) figure out why dhcp-all-interfaces.sh is starting DHCP on "virtual" interfaces. (perhaps our /sys/class/net/${interface}/addr_assign_type is invalid?). 2) use dhcp-all-interfaces.sh for initial provisioning and then remove it afterwards. This would potentially break the case where you want a network config reset that relied on a known to have worked once DHCP connection. I hate this bug so much. Dan, could you just fix it? Thanks. (In reply to Hugh Brock from comment #5) > I hate this bug so much. Dan, could you just fix it? Thanks. Sure, I've got some ideas for evaluating physical vs. virtual interfaces. A potential fix/work around has been posted here: https://review.openstack.org/272697 It would be nice to have more information from Kambiz (if possible) about why dhcp-all-interfaces.sh is leaking. Specifically if our /sys/class/net/${interface}/addr_assign_type check in this script isn't quite specific enough. What we were seeing is an ever increasing number of failed systemd resources over time as networks were created and deleted. The output of "systemctl" was an ever increasing list that included netns related services, all were failed (which may or may not be important), but I suspect this somehow impacted pacemaker as well. *** Bug 1299083 has been marked as a duplicate of this bug. *** I reproduce the bug on OSP 7.1 The failed resource are present on compute + controller (where interface are create/deleted). The failed resource appear when neutron delete the port (the dhcp-interface services are never deleted). On our plateform when the resource failed go over 10k, systemd start to be unusable and the only thing to do is a power cycle. To cleanup an existing system with failed resource we used "systemctl reset-failed" (it worked if the failed ressources are < 10k). I just applied the Dan's patch, we don't get any dhcp-interface created, so no more failed service. This bug did not make the OSP 8.0 release. It is being deferred to OSP 10. verified with: openstack-tripleo-image-elements-0.9.9-2.el7ost.noarch code merged to: /usr/share/tripleo-image-elements/os-net-config/os-refresh-config/configure.d/20-os-net-config following : Launch guests over time and look for the following on the controllers: systemctl | grep dhcp | grep fail | wc -l If the number goes up, you will likely be seeing this issue. Results: Every 2.0s: systemctl | grep dhcp | grep fail | wc -l Tue Apr 19 15:52:27 2016 2 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0653.html |