1503347 – OCP 3.7 Atomic System Containers cluster NetworkManager CPU Utlil 60% after 400+ hellopods deployed on compute node

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1503347 - OCP 3.7 Atomic System Containers cluster NetworkManager CPU Utlil 60% after 400+ hellopods deployed on compute node

Summary: OCP 3.7 Atomic System Containers cluster NetworkManager CPU Utlil 60% after ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	atomic
Sub Component:
Version:	7.4
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	7.5
Assignee:	Colin Walters
QA Contact:	atomic-bugs@redhat.com
Docs Contact:
URL:
Whiteboard:	aos-scalability-37
Depends On:
Blocks:	1186913 1494728 1513780
TreeView+	depends on / blocked

Reported:	2017-10-17 22:24 UTC by Walid A.
Modified:	2021-09-09 12:43 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-05-01 21:58:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
pod logs and event logs (3.11 MB, text/plain) 2017-10-17 22:24 UTC, Walid A.	no flags	Details
View All

Description Walid A. 2017-10-17 22:24:10 UTC

Created attachment 1339902 [details]
pod logs and event logs

Description of problem:
This is happening on OCP 3.7.0-0.143.2 Cluster on Atomic Host with OpenShift System Containers. When we run the Node Vertical Scalability test to deploy 500 hello-openshift pods per compute node, we are seeing NetworkManager process utilize 60% of a CPU core on m4.xlarge EC2 compute nodes, after about 350-400 hello-pods get deployed.  openvswitch CPU utilization is 20-30%.  As we approach 400 pods per node density, the newly deployed pods get stuck in Pending and ContainerCreating for 30-45 minutes before they go into running state.  Compute nodes were in NodeReady state.
On OCP 3.7.0-0.143.0 rpm install RHEL cluster NetworkManager and openvswitch combined were using about 1-2% CPU on average when running the same test.

Additionally there were 14023 instances of separate dhclient processes starting and stopping during the test on the compute node.

Cluster had one master/etcd, 1 infra node, and 2 compute nodes on EC2, m4.xlarge instances.

Version-Release number of selected component (if applicable):
#openshift version
openshift v3.7.0-0.143.2
kubernetes v1.7.0+80709908fd
etcd 3.2.1
# cat /etc/redhat-release
Red Hat Enterprise Linux Atomic Host release 7.4

# docker version
Client:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-1.12.6-48.git0fdc778.el7.x86_64
 Go version:      go1.8.3
 Git commit:      0fdc778/1.12.6
 Built:           Thu Jul 20 00:06:39 2017
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-1.12.6-48.git0fdc778.el7.x86_64
 Go version:      go1.8.3
 Git commit:      0fdc778/1.12.6
 Built:           Thu Jul 20 00:06:39 2017
 OS/Arch:         linux/amd64
# 
docker storage driver: overlay2

                 

How reproducible:
Reproduced at least twice.

Steps to Reproduce:
1. Git clone the openshift svt repo (https://github.com/openshift/svt/)
2. Use the following node-default.yaml and quota-default.yaml files in svt/openshift_scalability/content dir:
pod-default.json and quota-default.json templates used to deploy the hello-openshift pods can be found here:

https://gist.github.com/wabouhamad/6fe8c99a2cfe94c89dc1cb8ab85ff5ce

3. cd svt/openshift_scalability
Run cluster-loader.py (https://github.com/openshift/svt/blob/master/openshift_scalability/README.md) with config file:
https://gist.github.com/wabouhamad/d83a3dc4ef24d32e4582e0c2301591a6#file-nodevertical_1000_pods_per_node-yaml

python ./cluster-loader.py -vf config/nodeVertical_1000_pods_per_node.yaml

4. Wait till you have deployed 500 hello-pods deployed per node.  The script will deploy 500 pods per compute node (total 1000 per cluster).
You can check node CPU utilization on compute node when 300 pods are deployed using pidstat or top command.  We used containerized version of pbench tool (https://github.com/distributed-system-analysis/pbench/) to capture pidstat, iostat and sar stats at 60 sec time intervals and generated graphs.  pprof data was not available with containerized pbench (see next private comments)

Actual results:
100+ out of the 500 hello pods are stuck in Pending or ContainerCreating state for 15 to 30 minutes.  NetworkManager and openvswitch processes on compute node utilizing 60% and 20% of a CPU core, respectively

Expected results:
All 500 pods should go into running state within a few minutes from scheduling. NetworkManager and openvswitch process CPU utilization should be minimal, less than 5%

Additional info:
pod logs and event logs are attached.
Links to journal logs from master and compute nodes, along with pidstat data from pbench will be provided in next private comments

Comment 2 Dan Winship 2017-10-18 17:32:24 UTC

> we are seeing NetworkManager process utilize 60% of a CPU core on m4.xlarge EC2 compute nodes

> Additionally there were 14023 instances of separate dhclient processes starting and stopping during the test on the compute node.

So something is really screwed up here; NetworkManager should be ignoring veth devices, but the logs show that it's treating them like ordinary ethernet devices, and trying to run dhclient on them, etc.

Normally, the rules in /usr/lib/udev/rules.d/85-nm-unmanaged.rules should cause all veth devices to be marked NM_UNMANAGED=1:

    $ sudo ip link add dev vethtest-a type veth peer name vethtest-b

    $ udevadm info /sys/devices/virtual/net/vethtest-a
    P: /devices/virtual/net/vethtest-a
    E: DEVPATH=/devices/virtual/net/vethtest-a
    E: ID_MM_CANDIDATE=1
    E: ID_NET_DRIVER=veth
    E: ID_NET_LINK_FILE=/usr/lib/systemd/network/99-default.link
    E: IFINDEX=262
    E: INTERFACE=vethtest-a
->  E: NM_UNMANAGED=1
    E: SUBSYSTEM=net
    E: SYSTEMD_ALIAS=/sys/subsystem/net/devices/vethtest-a
    E: TAGS=:systemd:
    E: USEC_INITIALIZED=648636096211

Thomas Haller on the NM team suggests this may be bug 1462741; if the 80-net-setup-link.rules udev rule is disabled (which apparently might have been done at one point in Atomic Host?) then ID_NET_DRIVER would not be initialized correctly for devices (eg in the example above you wouldn't see "E: ID_NET_DRIVER=veth") and the NM rules wouldn't be applied properly. And the fix, according to that bug, is "DON'T DO THAT THEN!" because 80-net-setup-link.rules is kind of important.

(Thomas says he's pretty sure a bug was filed against Atomic Host but he can't find it.)

Comment 4 Dan Winship 2017-10-19 18:26:49 UTC

So can you confirm that /usr/lib/udev/rules.d/80-net-setup-link.rules is a symlink to /dev/null, and ethtool is not installed? And if so, can you try installing ethtool, rebooting, and seeing if that fixes things?

Is there anything weird about how you installed this system? No one else has run into this problem as far as I know.

Comment 5 Walid A. 2017-10-19 20:01:38 UTC

Both compute node got in Node NotReady state since Oct 17 and I could not ssh to them today.  After rebooting one compute node, I am able to ssh to it and confirm that:
1.  ethtool is installed in /usr/sbin/ethtool
2.  /usr/lib/udev/rules.d/80-net-setup-link.rules is NOT a symlink to /dev/null

# ls -ltr /usr/lib/udev/rules.d/80-net-setup-link.rules
-rw-r--r--. 2 root root 292 Jan  1  1970 /usr/lib/udev/rules.d/80-net-setup-link.rules
# 

# cat /usr/lib/udev/rules.d/80-net-setup-link.rules
# do not edit this file, it will be overwritten on update

SUBSYSTEM!="net", GOTO="net_setup_link_end"

IMPORT{builtin}="path_id"

ACTION!="add", GOTO="net_setup_link_end"

IMPORT{builtin}="net_setup_link"

NAME=="", ENV{ID_NET_NAME}!="", NAME="$env{ID_NET_NAME}"

LABEL="net_setup_link_end"
# 

---------
This environment was installed using the BYO advanced install procedure we always use to build OCP clusters on EC2, by running the playbook openshift-ansible/playbooks/byo/config.yml with variable openshift_use_system_containers: true

Comment 7 Dan Winship 2017-10-19 20:19:31 UTC

On IRC Walid also confirmed using udevadm that veth devices aren't getting the NM_UNMANAGED label, or any of the ID_* labels that ought to be getting set by 80-net-setup-link.rules.

So, udev is not working as expected on this host. I don't know why. It's not an OpenShift problem though; it's just NM doing lots of work that it shouldn't be doing, because udev is broken and failing to tell it that it shouldn't be managing those interfaces.

(Reassigning to what I *think* is the right component for atomic...)

Comment 11 James W. Mills 2017-10-24 14:29:45 UTC

I can verify in Atomic Host 7.4.2 that /etc/udev/rules.d/80-net-setup-link.rules is a symlink to /dev/null

I can also confirm that creating eth devices with "ip link add dev vethtest-a type veth peer name vethtest-b" does not give me the ID_NET_DRIVER variable:

# udevadm info /sys/devices/virtual/net/vethtest-a
P: /devices/virtual/net/vethtest-a
E: DEVPATH=/devices/virtual/net/vethtest-a
E: IFINDEX=15
E: INTERFACE=vethtest-a
E: SUBSYSTEM=net
E: SYSTEMD_ALIAS=/sys/subsystem/net/devices/vethtest-a
E: TAGS=:systemd:
E: USEC_INITIALIZED=507914790486


In addition, it spawns dhclient for both vethtest-a and vethtest-b.

The same holds true if I boot with biosdevname=0

However, after enabling udevd debug, and running the same commands, I can see this:

Oct 24 14:15:39 atomic-7.4.1 systemd-udevd[14884]: '/bin/sh -x -c 'ethtool -i $1 | sed -n s/^driver:\ //p' -- vethtest-b'(err) '--: ethtool: command not found'

If I copy the rule to /etc/udev/rules.d/, modify the command from:

PROGRAM="/bin/sh -c 'ethtool -i $1 | sed -n s/^driver:\ //p' -- $env{INTERFACE}", RESULT=="?*", ENV{ID_NET_DRIVER}="%c"

to:

PROGRAM="/bin/sh -c '/sbin/ethtool -i $1 | sed -n s/^driver:\ //p' -- $env{INTERFACE}", RESULT=="?*", ENV{ID_NET_DRIVER}="%c"

and killall -HUP /usr/lib/systemd/systemd-udevd

I see what I want to see when I add the veth devices (with or without biosdevname):

# udevadm info /sys/devices/virtual/net/vethtest-a
P: /devices/virtual/net/vethtest-a
E: DEVPATH=/devices/virtual/net/vethtest-a
E: ID_NET_DRIVER=veth
E: IFINDEX=17
E: INTERFACE=vethtest-a
E: NM_UNMANAGED=1
E: SUBSYSTEM=net
E: SYSTEMD_ALIAS=/sys/subsystem/net/devices/vethtest-a
E: TAGS=:systemd:
E: UDEV_BIOSDEVNAME=0
E: USEC_INITIALIZED=1603199719
E: biosdevname=0

Comment 12 Dan Winship 2017-10-24 15:43:45 UTC

> The same holds true if I boot with biosdevname=0

Right, biosdevname=0 was suggested as something Atomic Host could do instead of masking 80-net-setup-link.rules, not as something that would undo the effect of masking it.

> Oct 24 14:15:39 atomic-7.4.1 systemd-udevd[14884]: '/bin/sh -x -c 'ethtool -i $1 | sed -n s/^driver:\ //p' -- vethtest-b'(err) '--: ethtool: command not found'

Hm... OK, so there's a bug in NetworkManager's rules then. Given that, should we consider this bug to be entirely NM's fault then (and refile this bug over to NM), or do you still want to make changes to Atomic with respect to 80-net-setup-link.rules (in which case I can reopen bug 1462741 for the NM ethtool issue)?

Comment 14 Colin Walters 2017-10-24 19:05:39 UTC

https://pagure.io/fedora-kickstarts/pull-request/307

Comment 23 Colin Walters 2018-04-02 13:13:23 UTC

It should be in 7.5.0, but I dont' believe an errata for cloud images.

Note You need to log in before you can comment on or make changes to this bug.