Bug 1966521 - kube-proxy's userspace implementation consumes excessive CPU
Summary: kube-proxy's userspace implementation consumes excessive CPU
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: All
OS: Linux
high
high
Target Milestone: ---
: 4.10.0
Assignee: Dan Winship
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks: 2002288
TreeView+ depends on / blocked
 
Reported: 2021-06-01 10:33 UTC by Juan Luis de Sousa-Valadas
Modified: 2022-01-28 23:23 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When using service idling, the unidling proxy would become less and less efficient as more services were idled. Consequence: In a large cluster, with a large number of idle services (eg, 1000), the SDN pod would consistently use up to 100% of 1 CPU. Fix: The code for handling idle services was made more efficient. Result: CPU usage should be lower.
Clone Of:
: 2002288 (view as bug list)
Environment:
Last Closed: 2021-06-04 08:01:02 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 930 0 None None None 2021-09-03 19:08:47 UTC
Github openshift sdn pull 342 0 None None None 2021-09-03 19:08:47 UTC

Description Juan Luis de Sousa-Valadas 2021-06-01 10:33:46 UTC
Description of problem:

Customer had over 1000 idle services and this causes kube-proxy to call iptables -C very often which is computationally extremely expensive in terms of CPU.
SDN was consuming well over one cpu and by deleting idle services reduced the load to under 200m.

After using go tool pprof to generate a diagram it's pretty obvious that a large chuck of the CPU is taken by the userpsace Proxier in the function openOnePortal, investigating it we see:

(pprof) list openOnePortal
Total: 1.06s
ROUTINE ======================== k8s.io/kubernetes/pkg/proxy/userspace.(*Proxier).openOnePortal in k8s.io/kubernetes@v1.18.6/pkg/proxy/userspace/proxier.go
         0      220ms (flat, cum) 20.75% of Total
(pprof) list EnsureRule  
Total: 1.06s
ROUTINE ======================== k8s.io/kubernetes/pkg/util/iptables.(*runner).EnsureRule in k8s.io/kubernetes@v1.18.6/pkg/util/iptables/iptables.go
         0      190ms (flat, cum) 17.92% of Total
 Error: could not find file k8s.io/kubernetes@v1.18.6/pkg/util/iptables/iptables.go on path /tmp/mozilla_jdesousa0

So turns out that EnsureRule is very slow, checking the code it seems that it callss mostly checkRule and run, but virtually 100% of its time is consumed by checkRule

(pprof) list checkRule
Total: 1.06s
ROUTINE ======================== k8s.io/kubernetes/pkg/util/iptables.(*runner).checkRule in k8s.io/kubernetes@v1.18.6/pkg/util/iptables/iptables.go
         0      190ms (flat, cum) 17.92% of Total
ROUTINE ======================== k8s.io/kubernetes/pkg/util/iptables.(*runner).checkRuleUsingCheck in k8s.io/kubernetes@v1.18.6/pkg/util/iptables/iptables.go
         0      190ms (flat, cum) 17.92% of Total

And this calls checkRuleUsingCheck. So basically what we're doing is callign iptables -C A lot:

// Executes the rule check using the "-C" flag
func (runner *runner) checkRuleUsingCheck(args []string) (bool, error) {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
    defer cancel()

    out, err := runner.runContext(ctx, opCheckRule, args)
    if ctx.Err() == context.DeadlineExceeded {
        return false, fmt.Errorf("timed out while checking rules")
    }
    if err == nil {
        return true, nil
    }
    if ee, ok := err.(utilexec.ExitError); ok {
        // iptables uses exit(1) to indicate a failure of the operation,
        // as compared to a malformed commandline, for example.
        if ee.Exited() && ee.ExitStatus() == 1 {
            return false, nil
        }
    }
    return false, fmt.Errorf("error checking rule: %v: %s", err, out)
}

Version-Release number of selected component (if applicable):
This was done on OCP 4.6 but I don't believe there are significant chantes in kubernetes 1.21

How reproducible:
Always

Steps to Reproduce:
1. Create 1000 idle services.

Actual results:
Excessive CPU usage

Expected results:
Userspace proxy should do something smarter rather than checking each rule one by one by doing iptables -C. Perhaps something like what we do in the iptables proxy.

Additional info:

There is no customer impact currently impact because the customer managed to delete most of their idle services.

Comment 1 Alexander Constantinescu 2021-06-04 08:01:02 UTC
Hi

Thanks for the detailed analysis Juan!

We have to close this however, these are the reasons:

- There is a work-around (delete a big part of your idled services)
- We don't have enough resources to improve this right now


/Alex

Comment 9 Dan Winship 2021-08-19 17:51:31 UTC
We're going to investigate and plan to fix it, but given that we don't know what the problem is at this point, we can't commit to a timeline for having the fix yet.

Comment 11 Dan Winship 2021-08-23 18:34:30 UTC
I'm not able to reproduce this; I get CPU usage spikes when idling or unidling a lot of services, but it's kubelet using the CPU (processing all the pod deletions/creations), and usage returns to normal after the wave of idlings/unidlings is complete.

I see in the customer's data:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 362789 root      20   0  479540 430148   2644 R  97.3   1.3   3900:26 iptables-restor

That is... very weird. An iptables-restore command that has been running for 2.7 days, and is using 97.3% CPU. Something has gone really wrong there; it must be hitting some iptables-restore bug. I'm going to hand this bug over to the team that maintains iptables because I'm not sure exactly what sort of debugging info they'd need.

Comment 16 Phil Sutter 2021-08-31 12:45:42 UTC
Hi!

Sorry for the delay, I was out for two weeks.

Do we know the exact iptables RPM version this is happening with?

The (seemingly) busy-looping iptables-restore call is weird, indeed. I am not
aware of any bugs causing this, though that doesn't mean there is none.

Ideally we would use strace/gdb to check what the process is doing, though I
fear the setup is no longer available in that state, right?

If it's really that some situation causes kube-proxy to call 'iptables -C' too
often, a good solution might be to cache the output of 'iptables -S' and search
for lines matching the rule in question in there. 

Thanks, Phil

Comment 17 Dan Winship 2021-08-31 13:36:15 UTC
(In reply to Phil Sutter from comment #16)
> Do we know the exact iptables RPM version this is happening with?

RHCOS in 4.7 should have iptables-1.8.4-17.el8

> Ideally we would use strace/gdb to check what the process is doing, though I
> fear the setup is no longer available in that state, right?

it seems to happen reliably for them, so they should be able to get debugging

> If it's really that some situation causes kube-proxy to call 'iptables -C'
> too often

no, that was a total red herring. It has nothing to do with that.

Comment 20 Jochen Schneider 2021-08-31 14:38:56 UTC
It's actually iptables-1.8.4-15.el8_3.3.x86_64 and yes, we could probably make iptables-restore 
go nuts again.
Any suggestions for strace/gdb options that might help you?

KR,
 Jochen

Comment 21 Phil Sutter 2021-08-31 14:52:10 UTC
Hi Jochen,

(In reply to Jochen Schneider from comment #20)
> It's actually iptables-1.8.4-15.el8_3.3.x86_64 and yes, we could probably
> make iptables-restore 
> go nuts again.
> Any suggestions for strace/gdb options that might help you?

'strace -ffp <PID>' should suffice. Feel free to ping me via email in case there's a "kurzer Dienstweg". :)

Cheers, Phil

Comment 40 zhaozhanqi 2021-09-08 10:07:18 UTC
Do testing with fix PR on build 4.10.0-0.ci-2021-09-07-215741



steps: 

1.oc create -f https://raw.githubusercontent.com/openshift/verification-tests/master/testdata/networking/list_for_pods.json
2. $ cat service.yaml 
        {
            "apiVersion": "v1",
            "kind": "Service",
            "metadata": {
                "labels": {
                    "name": "test-service"
                },
                "name": "NAME"
            },
            "spec": {
                "ports": [
                    {
                        "name": "http",
                        "port": 27017,
                        "protocol": "TCP",
                        "targetPort": 8080
                    }
                ],
                "selector": {
                    "name": "test-pods"
                }
            }
        }


3. for i in {1..1000}; do sed s/NAME/test-$i/ < service.yaml | oc create -f - ;oc annotate endpoints test-$i "idling.alpha.openshift.io/unidle-targets"='[{"kind":"ReplicationController","name":"test-rc","replicas":2}]'; oc annotate endpoints test-$i "idling.alpha.openshift.io/idled-at"="2021-09-01T09:16:49Z"; done


During svc is creating.  the max CPU usage for openshift-sdn pod is 42% is better than old version ( almost 100% with same steps)

@

danw@redhat.com Do you think the test result is ok?

see the picture of cpu usage for openshift-sdn

Comment 44 zhaozhanqi 2021-09-09 01:18:52 UTC
Ok, Thanks Dan. Move this to verified.


Note You need to log in before you can comment on or make changes to this bug.