Bug 1966521 - kube-proxy's userspace implementation consumes excessive CPU
Summary: kube-proxy's userspace implementation consumes excessive CPU
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: All
OS: Linux
high
high
Target Milestone: ---
: 4.10.0
Assignee: Dan Winship
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks: 2002288 2063356
TreeView+ depends on / blocked
 
Reported: 2021-06-01 10:33 UTC by Juan Luis de Sousa-Valadas
Modified: 2022-09-06 15:52 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When using service idling, the unidling proxy would become less and less efficient as more services were idled. Consequence: In a large cluster, with a large number of idle services (eg, 1000), the SDN pod would consistently use up to 100% of 1 CPU. Fix: The code for handling idle services was made more efficient. Result: CPU usage should be lower.
Clone Of:
: 2002288 (view as bug list)
Environment:
Last Closed: 2022-03-10 16:03:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 930 0 None None None 2021-09-03 19:08:47 UTC
Github openshift sdn pull 342 0 None None None 2021-09-03 19:08:47 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:04:25 UTC

Description Juan Luis de Sousa-Valadas 2021-06-01 10:33:46 UTC
Description of problem:

Customer had over 1000 idle services and this causes kube-proxy to call iptables -C very often which is computationally extremely expensive in terms of CPU.
SDN was consuming well over one cpu and by deleting idle services reduced the load to under 200m.

After using go tool pprof to generate a diagram it's pretty obvious that a large chuck of the CPU is taken by the userpsace Proxier in the function openOnePortal, investigating it we see:

(pprof) list openOnePortal
Total: 1.06s
ROUTINE ======================== k8s.io/kubernetes/pkg/proxy/userspace.(*Proxier).openOnePortal in k8s.io/kubernetes.6/pkg/proxy/userspace/proxier.go
         0      220ms (flat, cum) 20.75% of Total
(pprof) list EnsureRule  
Total: 1.06s
ROUTINE ======================== k8s.io/kubernetes/pkg/util/iptables.(*runner).EnsureRule in k8s.io/kubernetes.6/pkg/util/iptables/iptables.go
         0      190ms (flat, cum) 17.92% of Total
 Error: could not find file k8s.io/kubernetes.6/pkg/util/iptables/iptables.go on path /tmp/mozilla_jdesousa0

So turns out that EnsureRule is very slow, checking the code it seems that it callss mostly checkRule and run, but virtually 100% of its time is consumed by checkRule

(pprof) list checkRule
Total: 1.06s
ROUTINE ======================== k8s.io/kubernetes/pkg/util/iptables.(*runner).checkRule in k8s.io/kubernetes.6/pkg/util/iptables/iptables.go
         0      190ms (flat, cum) 17.92% of Total
ROUTINE ======================== k8s.io/kubernetes/pkg/util/iptables.(*runner).checkRuleUsingCheck in k8s.io/kubernetes.6/pkg/util/iptables/iptables.go
         0      190ms (flat, cum) 17.92% of Total

And this calls checkRuleUsingCheck. So basically what we're doing is callign iptables -C A lot:

// Executes the rule check using the "-C" flag
func (runner *runner) checkRuleUsingCheck(args []string) (bool, error) {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
    defer cancel()

    out, err := runner.runContext(ctx, opCheckRule, args)
    if ctx.Err() == context.DeadlineExceeded {
        return false, fmt.Errorf("timed out while checking rules")
    }
    if err == nil {
        return true, nil
    }
    if ee, ok := err.(utilexec.ExitError); ok {
        // iptables uses exit(1) to indicate a failure of the operation,
        // as compared to a malformed commandline, for example.
        if ee.Exited() && ee.ExitStatus() == 1 {
            return false, nil
        }
    }
    return false, fmt.Errorf("error checking rule: %v: %s", err, out)
}

Version-Release number of selected component (if applicable):
This was done on OCP 4.6 but I don't believe there are significant chantes in kubernetes 1.21

How reproducible:
Always

Steps to Reproduce:
1. Create 1000 idle services.

Actual results:
Excessive CPU usage

Expected results:
Userspace proxy should do something smarter rather than checking each rule one by one by doing iptables -C. Perhaps something like what we do in the iptables proxy.

Additional info:

There is no customer impact currently impact because the customer managed to delete most of their idle services.

Comment 1 Alexander Constantinescu 2021-06-04 08:01:02 UTC
Hi

Thanks for the detailed analysis Juan!

We have to close this however, these are the reasons:

- There is a work-around (delete a big part of your idled services)
- We don't have enough resources to improve this right now


/Alex

Comment 9 Dan Winship 2021-08-19 17:51:31 UTC
We're going to investigate and plan to fix it, but given that we don't know what the problem is at this point, we can't commit to a timeline for having the fix yet.

Comment 11 Dan Winship 2021-08-23 18:34:30 UTC
I'm not able to reproduce this; I get CPU usage spikes when idling or unidling a lot of services, but it's kubelet using the CPU (processing all the pod deletions/creations), and usage returns to normal after the wave of idlings/unidlings is complete.

I see in the customer's data:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 362789 root      20   0  479540 430148   2644 R  97.3   1.3   3900:26 iptables-restor

That is... very weird. An iptables-restore command that has been running for 2.7 days, and is using 97.3% CPU. Something has gone really wrong there; it must be hitting some iptables-restore bug. I'm going to hand this bug over to the team that maintains iptables because I'm not sure exactly what sort of debugging info they'd need.

Comment 16 Phil Sutter 2021-08-31 12:45:42 UTC
Hi!

Sorry for the delay, I was out for two weeks.

Do we know the exact iptables RPM version this is happening with?

The (seemingly) busy-looping iptables-restore call is weird, indeed. I am not
aware of any bugs causing this, though that doesn't mean there is none.

Ideally we would use strace/gdb to check what the process is doing, though I
fear the setup is no longer available in that state, right?

If it's really that some situation causes kube-proxy to call 'iptables -C' too
often, a good solution might be to cache the output of 'iptables -S' and search
for lines matching the rule in question in there. 

Thanks, Phil

Comment 17 Dan Winship 2021-08-31 13:36:15 UTC
(In reply to Phil Sutter from comment #16)
> Do we know the exact iptables RPM version this is happening with?

RHCOS in 4.7 should have iptables-1.8.4-17.el8

> Ideally we would use strace/gdb to check what the process is doing, though I
> fear the setup is no longer available in that state, right?

it seems to happen reliably for them, so they should be able to get debugging

> If it's really that some situation causes kube-proxy to call 'iptables -C'
> too often

no, that was a total red herring. It has nothing to do with that.

Comment 20 Jochen Schneider 2021-08-31 14:38:56 UTC
It's actually iptables-1.8.4-15.el8_3.3.x86_64 and yes, we could probably make iptables-restore 
go nuts again.
Any suggestions for strace/gdb options that might help you?

KR,
 Jochen

Comment 21 Phil Sutter 2021-08-31 14:52:10 UTC
Hi Jochen,

(In reply to Jochen Schneider from comment #20)
> It's actually iptables-1.8.4-15.el8_3.3.x86_64 and yes, we could probably
> make iptables-restore 
> go nuts again.
> Any suggestions for strace/gdb options that might help you?

'strace -ffp <PID>' should suffice. Feel free to ping me via email in case there's a "kurzer Dienstweg". :)

Cheers, Phil

Comment 40 zhaozhanqi 2021-09-08 10:07:18 UTC
Do testing with fix PR on build 4.10.0-0.ci-2021-09-07-215741



steps: 

1.oc create -f https://raw.githubusercontent.com/openshift/verification-tests/master/testdata/networking/list_for_pods.json
2. $ cat service.yaml 
        {
            "apiVersion": "v1",
            "kind": "Service",
            "metadata": {
                "labels": {
                    "name": "test-service"
                },
                "name": "NAME"
            },
            "spec": {
                "ports": [
                    {
                        "name": "http",
                        "port": 27017,
                        "protocol": "TCP",
                        "targetPort": 8080
                    }
                ],
                "selector": {
                    "name": "test-pods"
                }
            }
        }


3. for i in {1..1000}; do sed s/NAME/test-$i/ < service.yaml | oc create -f - ;oc annotate endpoints test-$i "idling.alpha.openshift.io/unidle-targets"='[{"kind":"ReplicationController","name":"test-rc","replicas":2}]'; oc annotate endpoints test-$i "idling.alpha.openshift.io/idled-at"="2021-09-01T09:16:49Z"; done


During svc is creating.  the max CPU usage for openshift-sdn pod is 42% is better than old version ( almost 100% with same steps)

@

danw Do you think the test result is ok?

see the picture of cpu usage for openshift-sdn

Comment 44 zhaozhanqi 2021-09-09 01:18:52 UTC
Ok, Thanks Dan. Move this to verified.

Comment 51 errata-xmlrpc 2022-03-10 16:03:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 52 Akash Dubey 2022-09-02 20:40:49 UTC
Hi Team,

One of our customer is able to see this alert on 4.10.16
FIRING | NodeProxyApplySlow | prd-rcdn-01 | prd-rcdn-01-hzgq7-default-router-1b-9gctk | openshift-sdn | RUNBOOK URL

And I can see the node for which the alert is being fired the SDN pod takes times somewhere between 8-13 sec.

$ oc logs sdn-cfzbl -c sdn|grep SyncProxy
2022-08-26T13:23:29.785867027Z I0826 13:23:29.785801    4151 proxier.go:806] "SyncProxyRules complete" elapsed="13.407328435s"
2022-08-26T13:24:43.571436898Z I0826 13:24:43.571381    4151 proxier.go:806] "SyncProxyRules complete" elapsed="8.378959381s"
2022-08-26T14:18:29.905021314Z I0826 14:18:29.904961    4151 proxier.go:806] "SyncProxyRules complete" elapsed="11.638542409s"
2022-08-26T14:19:26.684130098Z I0826 14:19:26.684103    4151 proxier.go:806] "SyncProxyRules complete" elapsed="12.098231796s"
2022-08-26T14:19:55.885757510Z I0826 14:19:55.885735    4151 proxier.go:806] "SyncProxyRules complete" elapsed="9.681342889s"
2022-08-26T14:20:06.072115013Z I0826 14:20:06.072060    4151 proxier.go:806] "SyncProxyRules complete" elapsed="10.185406851s"


Which is way too long as compared to other SDN pod(s)

$ oc logs sdn-znvqh -c sdn|grep SyncProxy
2022-08-26T14:21:56.235750182Z I0826 14:21:56.235654    2910 proxier.go:806] "SyncProxyRules complete" elapsed="2.745043238s"
2022-08-26T14:22:05.575696511Z I0826 14:22:05.575559    2910 proxier.go:806] "SyncProxyRules complete" elapsed="2.659197681s"
2022-08-26T14:22:10.461810522Z I0826 14:22:10.461730    2910 proxier.go:806] "SyncProxyRules complete" elapsed="2.719790939s"
2022-08-26T14:22:13.241850132Z I0826 14:22:13.241791    2910 proxier.go:806] "SyncProxyRules complete" elapsed="2.779103856s"
2022-08-26T14:22:15.989606272Z I0826 14:22:15.989533    2910 proxier.go:806] "SyncProxyRules complete" elapsed="2.746797095s"
2022-08-26T14:22:18.779242498Z I0826 14:22:18.779141    2910 proxier.go:806] "SyncProxyRules complete" elapsed="2.78870992s"
2022-08-26T14:22:43.244129256Z I0826 14:22:43.244059    2910 proxier.go:806] "SyncProxyRules complete" elapsed="2.898532213s"
2022-08-26T14:23:02.251277259Z I0826 14:23:02.251002    2910 proxier.go:806] "SyncProxyRules complete" elapsed="3.565183211s"
2022-08-26T14:23:05.516604601Z I0826 14:23:05.516527    2910 proxier.go:806] "SyncProxyRules complete" elapsed="2.816440811s"

Request you to please have a look.

Please let me know if we need to raise a new Bug for this.

Looking forward for a response.

Comment 53 Dan Winship 2022-09-06 15:52:15 UTC
> Please let me know if we need to raise a new Bug for this.

This bug is already CLOSED, so yes, you need a new bug, if you determine that the customer has a bug to be fixed.


Note You need to log in before you can comment on or make changes to this bug.