Bug 1920481

Summary: kuryr-cni pods using unreasonable amount of CPU
Product: OpenShift Container Platform Reporter: Michał Dulko <mdulko>
Component: NetworkingAssignee: Michał Dulko <mdulko>
Networking sub component: kuryr QA Contact: GenadiC <gcheresh>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: rheinzma, rlobillo
Version: 4.7   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: All   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:56:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1920995    

Description Michał Dulko 2021-01-26 11:47:30 UTC
Description of problem:
kuryr-cni pods tend to use too much CPU on the nodes:

     17 root      20   0 1347416  77012   7232 S  40.6   0.5   2082:26 kuryr-daemon: master process [/usr/local/bin/kuryr-daemon --config-file /etc/kuryr/kuryr.conf]                          
     42 root      20   0  543320  76612   8352 S  38.9   0.5   2060:28 kuryr-daemon: Prometheus Exporter worker(0)                                                                             
     37 root      20   0  777176 117460   9872 S   0.3   0.7  12:19.91 kuryr-daemon: health worker(0)                                                                                          
      1 root      20   0  387148  83720  16348 S   0.0   0.5   0:22.80 kuryr-daemon: master process [/usr/local/bin/kuryr-daemon --config-file /etc/kuryr/kuryr.conf]                          
     27 root      20   0  475564  78412   8584 S   0.0   0.5  27:15.51 kuryr-daemon: watcher worker(0)                                                                                         
     33 root      20   0  395352  73372   5804 S   0.0   0.4   1:03.99 kuryr-daemon: server worker(0)                                                                                          
 197502 root      20   0   12024   3348   2804 S   0.0   0.0   0:00.03 bash                    
 197587 root      20   0   51216   4032   3372 R   0.0   0.0   0:00.14 top -c                                                                                                                  



Version-Release number of selected component (if applicable):


How reproducible:
Always?

Steps to Reproduce:
1. Run installation.
2. Run `top` inside kuryr-cni pods.

Actual results:
Up to 40% usage in some spikes.

Expected results:
On a idle cluster usage should be negligible.

Additional info:

Comment 2 rlobillo 2021-01-27 11:19:47 UTC
Verified on 4.7.0-0.nightly-2021-01-27-055755 on OSP16.1 (RHOS-16.1-RHEL-8-20201214.n.3) with OVN-Octavia. UPI installation.

Without the fix (4.7.0-0.nightly-2021-01-26-082025),Prometheus Exporter worker is consuming >30% of CPU on kuryr-cni for master-0 and worker-0:

$ oc get pods -n openshift-kuryr -o wide
NAME                                READY   STATUS    RESTARTS   AGE   IP             NODE                    NOMINATED NODE   READINESS GATES
kuryr-cni-4x7xm                     1/1     Running   0          18h   10.196.2.151   ostest-kvc8v-worker-2   <none>           <none>
kuryr-cni-65rvl                     1/1     Running   0          18h   10.196.1.228   ostest-kvc8v-worker-0   <none>           <none>
kuryr-cni-9twv5                     1/1     Running   0          18h   10.196.2.213   ostest-kvc8v-worker-1   <none>           <none>
kuryr-cni-jzgnv                     1/1     Running   0          18h   10.196.3.99    ostest-kvc8v-master-2   <none>           <none>
kuryr-cni-mlrd6                     1/1     Running   0          18h   10.196.0.99    ostest-kvc8v-master-0   <none>           <none>
kuryr-cni-x878q                     1/1     Running   0          18h   10.196.2.56    ostest-kvc8v-master-1   <none>           <none>
kuryr-controller-7cd7df9d65-wrh4v   1/1     Running   0          17h   10.196.0.99    ostest-kvc8v-master-0   <none>           <none>

$ oc rsh -n openshift-kuryr kuryr-cni-mlrd6
sh-4.4# top -n 1 -c -b
top - 11:00:26 up 19:51,  0 users,  load average: 3.77, 4.98, 4.87
Tasks:   8 total,   1 running,   7 sleeping,   0 stopped,   0 zombie
%Cpu(s): 31.1 us, 14.8 sy,  0.0 ni, 45.9 id,  0.0 wa,  3.3 hi,  1.6 si,  3.3 st
MiB Mem :  16024.2 total,   2207.8 free,   5860.3 used,   7956.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   9812.1 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
     41 root      20   0  545144  74684   8352 S  33.3   0.5 397:01.57 kuryr-daemon: Prometheus Exporter worker(0)
     17 root      20   0  611184  74180   7444 S  26.7   0.5 402:32.24 kuryr-daemon: master process [/usr/bin/kuryr-daemon --config-file /etc/kuryr/kuryr.conf]
      1 root      20   0  388972  79520  13864 S   0.0   0.5   0:06.52 kuryr-daemon: master process [/usr/bin/kuryr-daemon --config-file /etc/kuryr/kuryr.conf]
     27 root      20   0  476804  76376   9004 S   0.0   0.5   3:43.45 kuryr-daemon: watcher worker(0)
     33 root      20   0  397432  71788   5912 S   0.0   0.4   0:10.50 kuryr-daemon: server worker(0)
     36 root      20   0  700920  93228  10096 S   0.0   0.6   2:22.99 kuryr-daemon: health worker(0)
  38621 root      20   0   12024   3248   2820 S   0.0   0.0   0:00.01 /bin/sh
  38633 root      20   0   51056   3876   3324 R   0.0   0.0   0:00.00 top -n 1 -c -b
sh-4.4# exit
exit

$ oc rsh -n openshift-kuryr kuryr-cni-65rvl
sh-4.4# top -n 1 -c -b
top - 11:00:48 up 19:34,  0 users,  load average: 2.33, 1.74, 1.60
Tasks:   8 total,   1 running,   7 sleeping,   0 stopped,   0 zombie
%Cpu(s): 21.1 us, 12.3 sy,  0.0 ni, 63.2 id,  0.0 wa,  1.8 hi,  1.8 si,  0.0 st
MiB Mem :  16024.2 total,   8338.0 free,   3722.9 used,   3963.3 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  12596.1 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
     17 root      20   0  536940  73464   7468 S  60.0   0.4 584:58.76 kuryr-daemon: master process [/usr/bin/kuryr-daemon --config-file /etc/kuryr/kuryr.conf]
     42 root      20   0  545144  74776   8444 S  46.7   0.5 574:22.76 kuryr-daemon: Prometheus Exporter worker(0)
      1 root      20   0  388972  79232  13576 S   0.0   0.5   0:07.23 kuryr-daemon: master process [/usr/bin/kuryr-daemon --config-file /etc/kuryr/kuryr.conf]
     27 root      20   0  476808  76016   8876 S   0.0   0.5   1:09.75 kuryr-daemon: watcher worker(0)
     32 root      20   0  397432  71624   5748 S   0.0   0.4   0:14.16 kuryr-daemon: server worker(0)
     37 root      20   0  700920  94764  10080 S   0.0   0.6   2:19.23 kuryr-daemon: health worker(0)
  38871 root      20   0   12024   3180   2752 S   0.0   0.0   0:00.00 /bin/sh
  38882 root      20   0   51056   3732   3176 R   0.0   0.0   0:00.00 top -n 1 -c -b


With the fix, Prometheus Exporter worker is consuming ~0.0% CPU on kuryr-cni from master-0 and worker-0:

$ oc get pods -n openshift-kuryr -o wide
NAME                                READY   STATUS    RESTARTS   AGE    IP             NODE                    NOMINATED NODE   READINESS GATES
kuryr-cni-25vdv                     1/1     Running   0          21m    10.196.2.172   ostest-6tvgk-master-0   <none>           <none>
kuryr-cni-4fpd4                     1/1     Running   0          112s   10.196.2.243   ostest-6tvgk-worker-0   <none>           <none>
kuryr-cni-9njsw                     1/1     Running   0          115s   10.196.3.116   ostest-6tvgk-worker-2   <none>           <none>
kuryr-cni-gk72k                     1/1     Running   0          21m    10.196.2.133   ostest-6tvgk-master-1   <none>           <none>
kuryr-cni-h4g6x                     1/1     Running   0          21m    10.196.1.14    ostest-6tvgk-master-2   <none>           <none>
kuryr-cni-rsfhm                     1/1     Running   0          108s   10.196.2.240   ostest-6tvgk-worker-1   <none>           <none>
kuryr-controller-7b5ccb8857-t7lj8   1/1     Running   2          21m    10.196.2.172   ostest-6tvgk-master-0   <none>           <none>

$ oc rsh -n openshift-kuryr kuryr-cni-25vdv
sh-4.4# top -n 1 -c -b
top - 10:58:08 up 25 min,  0 users,  load average: 7.01, 5.36, 4.60
Tasks:  12 total,   3 running,   7 sleeping,   0 stopped,   2 zombie
%Cpu(s): 56.7 us, 22.4 sy,  0.0 ni, 11.9 id,  0.0 wa,  4.5 hi,  4.5 si,  0.0 st
MiB Mem :  16024.2 total,   3364.7 free,   5875.0 used,   6784.5 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   9785.6 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   1747 root      20   0       0      0      0 Z  23.5   0.0   0:00.53 [kuryr-daemon: s] <defunct>
      1 root      20   0  389080  79488  13832 R   0.0   0.5   0:02.72 kuryr-daemon: master process [/usr/bin/kuryr-daemon --config-file /etc/kuryr/kuryr.conf]
     17 root      20   0  889828  74536   7572 S   0.0   0.5   0:01.83 kuryr-daemon: master process [/usr/bin/kuryr-daemon --config-file /etc/kuryr/kuryr.conf]
     27 root      20   0  476932  75868   8684 S   0.0   0.5   0:02.50 kuryr-daemon: watcher worker(0)
     33 root      20   0  397540  71804   5904 S   0.0   0.4   0:00.61 kuryr-daemon: server worker(0)
     37 root      20   0  763492  83092  10000 S   0.0   0.5   0:03.45 kuryr-daemon: health worker(0)
     41 root      20   0  471396  71508   5432 S   0.0   0.4   0:00.64 kuryr-daemon: Prometheus Exporter worker(0)
   1748 root      20   0   12024   3172   2740 S   0.0   0.0   0:00.02 /bin/sh
   1756 root      20   0  398108  70956   4700 S   0.0   0.4   0:00.03 kuryr-daemon: server worker(0)
   1769 root      20   0       0      0      0 R   0.0   0.0   0:00.03 [kuryr-daemon: s]
   1775 root      20   0       0      0      0 Z   0.0   0.0   0:00.02 [kuryr-daemon: s] <defunct>
   1781 root      20   0   51056   3708   3156 R   0.0   0.0   0:00.00 top -n 1 -c -b
sh-4.4# exit
exit

$ oc rsh -n openshift-kuryr kuryr-cni-4fpd4
sh-4.4# top -n 1 -c -b
top - 10:58:28 up 4 min,  0 users,  load average: 3.37, 1.73, 0.71
Tasks:  18 total,   1 running,   9 sleeping,   0 stopped,   8 zombie
%Cpu(s): 48.4 us, 26.6 sy,  0.0 ni, 18.8 id,  0.0 wa,  3.1 hi,  3.1 si,  0.0 st
MiB Mem :  16024.2 total,  12376.1 free,   1186.7 used,   2461.4 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  14483.3 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    161 root      20   0       0      0      0 Z   6.2   0.0   0:00.36 [kuryr-daemon: s] <defunct>
      1 root      20   0  389080  79392  13728 S   0.0   0.5   0:03.20 kuryr-daemon: master process [/usr/bin/kuryr-daemon --config-file /etc/kuryr/kuryr.conf]
     18 root      20   0  905452  73784   7652 S   0.0   0.4   0:00.16 kuryr-daemon: master process [/usr/bin/kuryr-daemon --config-file /etc/kuryr/kuryr.conf]
     28 root      20   0  476016  77008   8628 S   0.0   0.5   0:00.19 kuryr-daemon: watcher worker(0)
     34 root      20   0  397540  71864   5964 S   0.0   0.4   0:00.11 kuryr-daemon: server worker(0)
     38 root      20   0  696932  77108   9976 S   0.0   0.5   0:00.27 kuryr-daemon: health worker(0)
     40 root      20   0  471264  73492   5524 S   0.0   0.4   0:00.07 kuryr-daemon: Prometheus Exporter worker(0)
    157 root      20   0  397540  71032   5004 S   0.0   0.4   0:00.02 kuryr-daemon: server worker(0)
    159 root      20   0       0      0      0 Z   0.0   0.0   0:00.37 [kuryr-daemon: s] <defunct>
    169 root      20   0  397540  70724   4752 S   0.0   0.4   0:00.02 kuryr-daemon: server worker(0)
    200 root      20   0   12024   3160   2736 S   0.0   0.0   0:00.02 /bin/sh
    207 root      20   0       0      0      0 Z   0.0   0.0   0:00.02 [kuryr-daemon: s] <defunct>
    219 root      20   0       0      0      0 Z   0.0   0.0   0:00.02 [kuryr-daemon: s] <defunct>
    225 root      20   0       0      0      0 Z   0.0   0.0   0:00.02 [kuryr-daemon: s] <defunct>
    250 root      20   0       0      0      0 Z   0.0   0.0   0:00.02 [kuryr-daemon: s] <defunct>
    268 root      20   0       0      0      0 Z   0.0   0.0   0:00.02 [kuryr-daemon: s] <defunct>
    274 root      20   0       0      0      0 Z   0.0   0.0   0:00.01 [kuryr-daemon: s] <defunct>
    280 root      20   0   51056   3768   3216 R   0.0   0.0   0:00.00 top -n 1 -c -b

Furthermore, prometheus reporting is working as expected generating its alerts:

$ curl -sk -H "Authorization: Bearer $token" 'https://prometheus-k8s-openshift-monitoring.apps.ostest.shiftstack.com/api/v1/alerts' | jq '.data.alerts[] | select(.labels.alertname) | .labels.alertname' 
"AlertmanagerReceiversNotConfigured"
"Watchdog"

Comment 7 errata-xmlrpc 2021-02-24 15:56:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633