Bug 1503347
Summary: | OCP 3.7 Atomic System Containers cluster NetworkManager CPU Utlil 60% after 400+ hellopods deployed on compute node | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Walid A. <wabouham> | ||||
Component: | atomic | Assignee: | Colin Walters <walters> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | atomic-bugs <atomic-bugs> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.4 | CC: | aos-bugs, bbaude, danw, ddarrah, dornelas, dwalsh, fkluknav, jamills, mifiedle, nbhatt, smilner, wabouham, walters, ypu | ||||
Target Milestone: | rc | Keywords: | Extras | ||||
Target Release: | 7.5 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | aos-scalability-37 | ||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-05-01 21:58:03 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1186913, 1494728, 1513780 | ||||||
Attachments: |
|
Description
Walid A.
2017-10-17 22:24:10 UTC
> we are seeing NetworkManager process utilize 60% of a CPU core on m4.xlarge EC2 compute nodes > Additionally there were 14023 instances of separate dhclient processes starting and stopping during the test on the compute node. So something is really screwed up here; NetworkManager should be ignoring veth devices, but the logs show that it's treating them like ordinary ethernet devices, and trying to run dhclient on them, etc. Normally, the rules in /usr/lib/udev/rules.d/85-nm-unmanaged.rules should cause all veth devices to be marked NM_UNMANAGED=1: $ sudo ip link add dev vethtest-a type veth peer name vethtest-b $ udevadm info /sys/devices/virtual/net/vethtest-a P: /devices/virtual/net/vethtest-a E: DEVPATH=/devices/virtual/net/vethtest-a E: ID_MM_CANDIDATE=1 E: ID_NET_DRIVER=veth E: ID_NET_LINK_FILE=/usr/lib/systemd/network/99-default.link E: IFINDEX=262 E: INTERFACE=vethtest-a -> E: NM_UNMANAGED=1 E: SUBSYSTEM=net E: SYSTEMD_ALIAS=/sys/subsystem/net/devices/vethtest-a E: TAGS=:systemd: E: USEC_INITIALIZED=648636096211 Thomas Haller on the NM team suggests this may be bug 1462741; if the 80-net-setup-link.rules udev rule is disabled (which apparently might have been done at one point in Atomic Host?) then ID_NET_DRIVER would not be initialized correctly for devices (eg in the example above you wouldn't see "E: ID_NET_DRIVER=veth") and the NM rules wouldn't be applied properly. And the fix, according to that bug, is "DON'T DO THAT THEN!" because 80-net-setup-link.rules is kind of important. (Thomas says he's pretty sure a bug was filed against Atomic Host but he can't find it.) So can you confirm that /usr/lib/udev/rules.d/80-net-setup-link.rules is a symlink to /dev/null, and ethtool is not installed? And if so, can you try installing ethtool, rebooting, and seeing if that fixes things? Is there anything weird about how you installed this system? No one else has run into this problem as far as I know. Both compute node got in Node NotReady state since Oct 17 and I could not ssh to them today. After rebooting one compute node, I am able to ssh to it and confirm that: 1. ethtool is installed in /usr/sbin/ethtool 2. /usr/lib/udev/rules.d/80-net-setup-link.rules is NOT a symlink to /dev/null # ls -ltr /usr/lib/udev/rules.d/80-net-setup-link.rules -rw-r--r--. 2 root root 292 Jan 1 1970 /usr/lib/udev/rules.d/80-net-setup-link.rules # # cat /usr/lib/udev/rules.d/80-net-setup-link.rules # do not edit this file, it will be overwritten on update SUBSYSTEM!="net", GOTO="net_setup_link_end" IMPORT{builtin}="path_id" ACTION!="add", GOTO="net_setup_link_end" IMPORT{builtin}="net_setup_link" NAME=="", ENV{ID_NET_NAME}!="", NAME="$env{ID_NET_NAME}" LABEL="net_setup_link_end" # --------- This environment was installed using the BYO advanced install procedure we always use to build OCP clusters on EC2, by running the playbook openshift-ansible/playbooks/byo/config.yml with variable openshift_use_system_containers: true On IRC Walid also confirmed using udevadm that veth devices aren't getting the NM_UNMANAGED label, or any of the ID_* labels that ought to be getting set by 80-net-setup-link.rules. So, udev is not working as expected on this host. I don't know why. It's not an OpenShift problem though; it's just NM doing lots of work that it shouldn't be doing, because udev is broken and failing to tell it that it shouldn't be managing those interfaces. (Reassigning to what I *think* is the right component for atomic...) I can verify in Atomic Host 7.4.2 that /etc/udev/rules.d/80-net-setup-link.rules is a symlink to /dev/null I can also confirm that creating eth devices with "ip link add dev vethtest-a type veth peer name vethtest-b" does not give me the ID_NET_DRIVER variable: # udevadm info /sys/devices/virtual/net/vethtest-a P: /devices/virtual/net/vethtest-a E: DEVPATH=/devices/virtual/net/vethtest-a E: IFINDEX=15 E: INTERFACE=vethtest-a E: SUBSYSTEM=net E: SYSTEMD_ALIAS=/sys/subsystem/net/devices/vethtest-a E: TAGS=:systemd: E: USEC_INITIALIZED=507914790486 In addition, it spawns dhclient for both vethtest-a and vethtest-b. The same holds true if I boot with biosdevname=0 However, after enabling udevd debug, and running the same commands, I can see this: Oct 24 14:15:39 atomic-7.4.1 systemd-udevd[14884]: '/bin/sh -x -c 'ethtool -i $1 | sed -n s/^driver:\ //p' -- vethtest-b'(err) '--: ethtool: command not found' If I copy the rule to /etc/udev/rules.d/, modify the command from: PROGRAM="/bin/sh -c 'ethtool -i $1 | sed -n s/^driver:\ //p' -- $env{INTERFACE}", RESULT=="?*", ENV{ID_NET_DRIVER}="%c" to: PROGRAM="/bin/sh -c '/sbin/ethtool -i $1 | sed -n s/^driver:\ //p' -- $env{INTERFACE}", RESULT=="?*", ENV{ID_NET_DRIVER}="%c" and killall -HUP /usr/lib/systemd/systemd-udevd I see what I want to see when I add the veth devices (with or without biosdevname): # udevadm info /sys/devices/virtual/net/vethtest-a P: /devices/virtual/net/vethtest-a E: DEVPATH=/devices/virtual/net/vethtest-a E: ID_NET_DRIVER=veth E: IFINDEX=17 E: INTERFACE=vethtest-a E: NM_UNMANAGED=1 E: SUBSYSTEM=net E: SYSTEMD_ALIAS=/sys/subsystem/net/devices/vethtest-a E: TAGS=:systemd: E: UDEV_BIOSDEVNAME=0 E: USEC_INITIALIZED=1603199719 E: biosdevname=0 > The same holds true if I boot with biosdevname=0 Right, biosdevname=0 was suggested as something Atomic Host could do instead of masking 80-net-setup-link.rules, not as something that would undo the effect of masking it. > Oct 24 14:15:39 atomic-7.4.1 systemd-udevd[14884]: '/bin/sh -x -c 'ethtool -i $1 | sed -n s/^driver:\ //p' -- vethtest-b'(err) '--: ethtool: command not found' Hm... OK, so there's a bug in NetworkManager's rules then. Given that, should we consider this bug to be entirely NM's fault then (and refile this bug over to NM), or do you still want to make changes to Atomic with respect to 80-net-setup-link.rules (in which case I can reopen bug 1462741 for the NM ethtool issue)? It should be in 7.5.0, but I dont' believe an errata for cloud images. |