Description of problem: Customer had over 1000 idle services and this causes kube-proxy to call iptables -C very often which is computationally extremely expensive in terms of CPU. SDN was consuming well over one cpu and by deleting idle services reduced the load to under 200m. After using go tool pprof to generate a diagram it's pretty obvious that a large chuck of the CPU is taken by the userpsace Proxier in the function openOnePortal, investigating it we see: (pprof) list openOnePortal Total: 1.06s ROUTINE ======================== k8s.io/kubernetes/pkg/proxy/userspace.(*Proxier).openOnePortal in k8s.io/kubernetes.6/pkg/proxy/userspace/proxier.go 0 220ms (flat, cum) 20.75% of Total (pprof) list EnsureRule Total: 1.06s ROUTINE ======================== k8s.io/kubernetes/pkg/util/iptables.(*runner).EnsureRule in k8s.io/kubernetes.6/pkg/util/iptables/iptables.go 0 190ms (flat, cum) 17.92% of Total Error: could not find file k8s.io/kubernetes.6/pkg/util/iptables/iptables.go on path /tmp/mozilla_jdesousa0 So turns out that EnsureRule is very slow, checking the code it seems that it callss mostly checkRule and run, but virtually 100% of its time is consumed by checkRule (pprof) list checkRule Total: 1.06s ROUTINE ======================== k8s.io/kubernetes/pkg/util/iptables.(*runner).checkRule in k8s.io/kubernetes.6/pkg/util/iptables/iptables.go 0 190ms (flat, cum) 17.92% of Total ROUTINE ======================== k8s.io/kubernetes/pkg/util/iptables.(*runner).checkRuleUsingCheck in k8s.io/kubernetes.6/pkg/util/iptables/iptables.go 0 190ms (flat, cum) 17.92% of Total And this calls checkRuleUsingCheck. So basically what we're doing is callign iptables -C A lot: // Executes the rule check using the "-C" flag func (runner *runner) checkRuleUsingCheck(args []string) (bool, error) { ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute) defer cancel() out, err := runner.runContext(ctx, opCheckRule, args) if ctx.Err() == context.DeadlineExceeded { return false, fmt.Errorf("timed out while checking rules") } if err == nil { return true, nil } if ee, ok := err.(utilexec.ExitError); ok { // iptables uses exit(1) to indicate a failure of the operation, // as compared to a malformed commandline, for example. if ee.Exited() && ee.ExitStatus() == 1 { return false, nil } } return false, fmt.Errorf("error checking rule: %v: %s", err, out) } Version-Release number of selected component (if applicable): This was done on OCP 4.6 but I don't believe there are significant chantes in kubernetes 1.21 How reproducible: Always Steps to Reproduce: 1. Create 1000 idle services. Actual results: Excessive CPU usage Expected results: Userspace proxy should do something smarter rather than checking each rule one by one by doing iptables -C. Perhaps something like what we do in the iptables proxy. Additional info: There is no customer impact currently impact because the customer managed to delete most of their idle services.
Hi Thanks for the detailed analysis Juan! We have to close this however, these are the reasons: - There is a work-around (delete a big part of your idled services) - We don't have enough resources to improve this right now /Alex
We're going to investigate and plan to fix it, but given that we don't know what the problem is at this point, we can't commit to a timeline for having the fix yet.
I'm not able to reproduce this; I get CPU usage spikes when idling or unidling a lot of services, but it's kubelet using the CPU (processing all the pod deletions/creations), and usage returns to normal after the wave of idlings/unidlings is complete. I see in the customer's data: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 362789 root 20 0 479540 430148 2644 R 97.3 1.3 3900:26 iptables-restor That is... very weird. An iptables-restore command that has been running for 2.7 days, and is using 97.3% CPU. Something has gone really wrong there; it must be hitting some iptables-restore bug. I'm going to hand this bug over to the team that maintains iptables because I'm not sure exactly what sort of debugging info they'd need.
Hi! Sorry for the delay, I was out for two weeks. Do we know the exact iptables RPM version this is happening with? The (seemingly) busy-looping iptables-restore call is weird, indeed. I am not aware of any bugs causing this, though that doesn't mean there is none. Ideally we would use strace/gdb to check what the process is doing, though I fear the setup is no longer available in that state, right? If it's really that some situation causes kube-proxy to call 'iptables -C' too often, a good solution might be to cache the output of 'iptables -S' and search for lines matching the rule in question in there. Thanks, Phil
(In reply to Phil Sutter from comment #16) > Do we know the exact iptables RPM version this is happening with? RHCOS in 4.7 should have iptables-1.8.4-17.el8 > Ideally we would use strace/gdb to check what the process is doing, though I > fear the setup is no longer available in that state, right? it seems to happen reliably for them, so they should be able to get debugging > If it's really that some situation causes kube-proxy to call 'iptables -C' > too often no, that was a total red herring. It has nothing to do with that.
It's actually iptables-1.8.4-15.el8_3.3.x86_64 and yes, we could probably make iptables-restore go nuts again. Any suggestions for strace/gdb options that might help you? KR, Jochen
Hi Jochen, (In reply to Jochen Schneider from comment #20) > It's actually iptables-1.8.4-15.el8_3.3.x86_64 and yes, we could probably > make iptables-restore > go nuts again. > Any suggestions for strace/gdb options that might help you? 'strace -ffp <PID>' should suffice. Feel free to ping me via email in case there's a "kurzer Dienstweg". :) Cheers, Phil
Do testing with fix PR on build 4.10.0-0.ci-2021-09-07-215741 steps: 1.oc create -f https://raw.githubusercontent.com/openshift/verification-tests/master/testdata/networking/list_for_pods.json 2. $ cat service.yaml { "apiVersion": "v1", "kind": "Service", "metadata": { "labels": { "name": "test-service" }, "name": "NAME" }, "spec": { "ports": [ { "name": "http", "port": 27017, "protocol": "TCP", "targetPort": 8080 } ], "selector": { "name": "test-pods" } } } 3. for i in {1..1000}; do sed s/NAME/test-$i/ < service.yaml | oc create -f - ;oc annotate endpoints test-$i "idling.alpha.openshift.io/unidle-targets"='[{"kind":"ReplicationController","name":"test-rc","replicas":2}]'; oc annotate endpoints test-$i "idling.alpha.openshift.io/idled-at"="2021-09-01T09:16:49Z"; done During svc is creating. the max CPU usage for openshift-sdn pod is 42% is better than old version ( almost 100% with same steps) @ danw Do you think the test result is ok? see the picture of cpu usage for openshift-sdn
Ok, Thanks Dan. Move this to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056
Hi Team, One of our customer is able to see this alert on 4.10.16 FIRING | NodeProxyApplySlow | prd-rcdn-01 | prd-rcdn-01-hzgq7-default-router-1b-9gctk | openshift-sdn | RUNBOOK URL And I can see the node for which the alert is being fired the SDN pod takes times somewhere between 8-13 sec. $ oc logs sdn-cfzbl -c sdn|grep SyncProxy 2022-08-26T13:23:29.785867027Z I0826 13:23:29.785801 4151 proxier.go:806] "SyncProxyRules complete" elapsed="13.407328435s" 2022-08-26T13:24:43.571436898Z I0826 13:24:43.571381 4151 proxier.go:806] "SyncProxyRules complete" elapsed="8.378959381s" 2022-08-26T14:18:29.905021314Z I0826 14:18:29.904961 4151 proxier.go:806] "SyncProxyRules complete" elapsed="11.638542409s" 2022-08-26T14:19:26.684130098Z I0826 14:19:26.684103 4151 proxier.go:806] "SyncProxyRules complete" elapsed="12.098231796s" 2022-08-26T14:19:55.885757510Z I0826 14:19:55.885735 4151 proxier.go:806] "SyncProxyRules complete" elapsed="9.681342889s" 2022-08-26T14:20:06.072115013Z I0826 14:20:06.072060 4151 proxier.go:806] "SyncProxyRules complete" elapsed="10.185406851s" Which is way too long as compared to other SDN pod(s) $ oc logs sdn-znvqh -c sdn|grep SyncProxy 2022-08-26T14:21:56.235750182Z I0826 14:21:56.235654 2910 proxier.go:806] "SyncProxyRules complete" elapsed="2.745043238s" 2022-08-26T14:22:05.575696511Z I0826 14:22:05.575559 2910 proxier.go:806] "SyncProxyRules complete" elapsed="2.659197681s" 2022-08-26T14:22:10.461810522Z I0826 14:22:10.461730 2910 proxier.go:806] "SyncProxyRules complete" elapsed="2.719790939s" 2022-08-26T14:22:13.241850132Z I0826 14:22:13.241791 2910 proxier.go:806] "SyncProxyRules complete" elapsed="2.779103856s" 2022-08-26T14:22:15.989606272Z I0826 14:22:15.989533 2910 proxier.go:806] "SyncProxyRules complete" elapsed="2.746797095s" 2022-08-26T14:22:18.779242498Z I0826 14:22:18.779141 2910 proxier.go:806] "SyncProxyRules complete" elapsed="2.78870992s" 2022-08-26T14:22:43.244129256Z I0826 14:22:43.244059 2910 proxier.go:806] "SyncProxyRules complete" elapsed="2.898532213s" 2022-08-26T14:23:02.251277259Z I0826 14:23:02.251002 2910 proxier.go:806] "SyncProxyRules complete" elapsed="3.565183211s" 2022-08-26T14:23:05.516604601Z I0826 14:23:05.516527 2910 proxier.go:806] "SyncProxyRules complete" elapsed="2.816440811s" Request you to please have a look. Please let me know if we need to raise a new Bug for this. Looking forward for a response.
> Please let me know if we need to raise a new Bug for this. This bug is already CLOSED, so yes, you need a new bug, if you determine that the customer has a bug to be fixed.