Bug 1879607

Summary:	Improve the poorly designed iptables setup
Product:	OpenShift Container Platform	Reporter:	Phil Sutter <psutter>
Component:	Networking	Assignee:	Surya Seetharaman <surya>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED UPSTREAM	Docs Contact:
Severity:	high
Priority:	high	CC:	aconstan, bbennett, danw, dcbw, dyocum, rkhan, surya
Version:	4.4
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-01-04 15:17:36 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1892110

Description Phil Sutter 2020-09-16 15:58:26 UTC

This may be the wrong (sub-)component, please correct it if so - I have no idea where it belongs to.

When dealing with bug 1873127, I noticed how bad the iptables ruleset becomes with many routes (assuming that's the correct term). Maybe this is even not a case of problematic setup but just rules not being cleaned up or something like that. Anyway, here's what I noted when having a close look at the firewall ruleset in place on the customer's ingress controller:

Largest chains are KUBE-SERVICES in nat (33k rules) and KUBE-SERVICES in filter
(4k rules). The latter is guarded by check for conntrack state being NEW, in
nat that's implicit so existing connections should be OK. (haproxy health
checks sadly are not.)

In nat, KUBE-SERVICES looks like this:

| Chain KUBE-SERVICES (2 references)
|  pkts bytes target     prot opt in     out     source               destination
|     0     0 KUBE-MARK-MASQ  tcp  --  *      *      !172.20.0.0/16        172.30.78.63         /* da-saas-non-prod-f91b3f3bf7894b05b8965d5d243cb81c/bps-facade-svc:platform-http cluster IP */ tcp dpt:8081
|     0     0 KUBE-SVC-KN5EB6ZOVNZ4KWTQ  tcp  --  *      *       0.0.0.0/0            172.30.78.63         /* da-saas-non-prod-f91b3f3bf7894b05b8965d5d243cb81c/bps-facade-svc:platform-http cluster IP */ tcp dpt:8081
|     0     0 KUBE-MARK-MASQ  tcp  --  *      *      !172.20.0.0/16        172.30.68.103        /* dl-id-os-jl/glusterfs-dynamic-postgresql-4: cluster IP */ tcp dpt:1
|     0     0 KUBE-SVC-SRRROCQS5P62W363  tcp  --  *      *       0.0.0.0/0            172.30.68.103        /* dl-id-os-jl/glusterfs-dynamic-postgresql-4: cluster IP */ tcp dpt:1

So basically a check whether a packet should be marked and a call to DNAT for
each destination IP:port pair. Probably this could be simplified a lot using
ipset.

The chain is referenced from both PREROUTING and OUTPUT, which means in a setup
where OVS does the routing the same packet hits this chain twice.

In filter, KUBE-SERVICES looks like this:

| Chain KUBE-SERVICES (3 references)
|  pkts bytes target     prot opt in     out     source               destination
|     0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.59.240        /* da-na-excalibur-dev/kie-test2:8080-tcp has no endpoints */ tcp dpt:8080 reject-with icmp-port-unreachable
|     0     0 REJECT     tcp  --  *      *       0.0.0.0/0            172.30.22.23         /* cismodels/grpc-server:50051-tcp has no endpoints */ tcp dpt:50051 reject-with icmp-port-unreachable

So a reject for each destination IP:port pair. If ipset is no option you could
still build a tree by matching subnets and therefore reduce the number of rules
a packet has to traverse.

This chain is referenced from both INPUT and OUTPUT, so same problem on a
system using OVS for routing. With haproxy running in a dedicated container, an
incoming SYN for an application faces this route:

- nat PREROUTING -> KUBE-SERVICES -> 33k rules
- filter INPUT -> KUBE-SERVICES -> 4k rules
- OVS
- filter OUTPUT -> KUBE-SERVICES -> 4k rules
- nat OUTPUT -> KUBE-SERVICES -> 33k rules
- haproxy
- nat PREROUTING -> KUBE-SERVICES -> 33k rules
- filter INPUT -> KUBE-SERVICES -> 4k rules
- OVS
- filter OUTPUT -> KUBE-SERVICES -> 4k rules
- nat OUTPUT -> KUBE-SERVICES -> 33k rules

Customer's setup is obviously huge, but every rule counts four times with this
design. Isn't it possible to match on interface or something to reduce this?

Oddly enough, according to packet counters 212M packets hit KUBE-SERVICES from
nat PREROUTING and 211M packets from nat OUTPUT. Yet only two rules in
KUBE-SERVICES itself have non-zero counters, one of them being the final
KUBE-NODEPORTS jump. Same for KUBE-SERVICES in filter: 280M packets hit it from
filter INPUT, 447M packets from filter OUTPUT. And not a single rule with
non-zero counters in there. Either this is a case of partial counter restore
and the node has rebooted before (well possible) or all the fuss is even about
no productive traffic.

Comment 1 Dan Winship 2020-09-17 15:44:32 UTC

So do you think something needs to be fixed in order to solve the customer problem in 1873127, or are you just offering suggestions on ways to improve the code? If the latter, we're mostly not bothering to try to improve it further, because the long term plan is to move to ovn-kubernetes...

(In reply to Phil Sutter from comment #0)
> So basically a check whether a packet should be marked and a call to DNAT for
> each destination IP:port pair. Probably this could be simplified a lot using
> ipset.

Each service has different IPs to be checking for / NATting to though...

> So a reject for each destination IP:port pair. If ipset is no option you
> could
> still build a tree by matching subnets and therefore reduce the number of
> rules
> a packet has to traverse.

Service IPs are assigned at random out of the service subnet (a /16), so you're never likely able to "match subnets". We could probably use ipset, but this table is only used for services which are defined but which (currently) have no pods backing them, so it is not expected to be _that_ large.

> Customer's setup is obviously huge, but every rule counts four times with
> this
> design. Isn't it possible to match on interface or something to reduce this?

Oh, probably. We have made changes to the rules before when we observed performance problems (eg, the check for connstate being NEW was added to fix massive performance problems seen on OpenShift Online). But at this point the plan is to migrate everyone to ovn-kubernetes eventually, so the iptables-based kube-proxy will go away.

> non-zero counters in there. Either this is a case of partial counter restore
> and the node has rebooted before (well possible) or all the fuss is even
> about no productive traffic.

Yeah, rules frequently get regenerated. You can't trust the counters.

Comment 2 Phil Sutter 2020-09-18 08:35:33 UTC

Hi Dan,

(In reply to Dan Winship from comment #1)
> So do you think something needs to be fixed in order to solve the customer
> problem in 1873127, or are you just offering suggestions on ways to improve
> the code? If the latter, we're mostly not bothering to try to improve it
> further, because the long term plan is to move to ovn-kubernetes...

It's the latter, unless something has considerably changed since 4.3 (or maybe
3.11) as that (alegedly) has been working fine. Currently the customer has 15k
routes and "everything" goes up in smoke. They claim with OCP3.11 they even had
27k and things were running smoothly.

I expected that kind of reply already, knowing that OVN is considered the
saviour for the iptables-legacy/nft dilemma as well. Being rather unfamiliar
with OVS setup, I wonder if there are ways in there to avoid long lists of
rules as seen with iptables.

> (In reply to Phil Sutter from comment #0)
> > So basically a check whether a packet should be marked and a call to DNAT for
> > each destination IP:port pair. Probably this could be simplified a lot using
> > ipset.
> 
> Each service has different IPs to be checking for / NATting to though...

It would work for the first conditional jump to KUBE-MARK-MASQ. With nftables
one could use a map to turn the second KUBE-SVC-FOO jump into an O(1)
operation, too.

> > So a reject for each destination IP:port pair. If ipset is no option you
> > could
> > still build a tree by matching subnets and therefore reduce the number of
> > rules
> > a packet has to traverse.
> 
> Service IPs are assigned at random out of the service subnet (a /16), so
> you're never likely able to "match subnets". We could probably use ipset,
> but this table is only used for services which are defined but which
> (currently) have no pods backing them, so it is not expected to be _that_
> large.

The tree idea goes like this:

| -A KUBE-SERVICES -d 172.30.0.0/18 -j KUBE-SERVICES-0
| -A KUBE-SERVICES -d 172.30.64.0/18 -j KUBE-SERVICES-64
| -A KUBE-SERVICES -d 172.30.128.0/18 -j KUBE-SERVICES-128
| -A KUBE-SERVICES -d 172.30.192.0/18 -j KUBE-SERVICES-192
| -A KUBE-SERVICES-0   -d 172.30.43.174 -p tcp --dport 27017 -j REJECT
| -A KUBE-SERVICES-128 -d 172.30.187.66 -p tcp --dport 5000 -j REJECT
| -A KUBE-SERVICES-0 -d 172.30.44.39 -p tcp --dport 8080 -j REJECT

The four extra chains and rules quarter the amount of rules each packet has to
traverse - assuming perfect randomness, of course. :)

Since this chain has grown to 4k rules, would you say the customer is mis-using
the product or has a configuration issue or something like that?

> > Customer's setup is obviously huge, but every rule counts four times with
> > this
> > design. Isn't it possible to match on interface or something to reduce this?
> 
> Oh, probably. We have made changes to the rules before when we observed
> performance problems (eg, the check for connstate being NEW was added to fix
> massive performance problems seen on OpenShift Online). But at this point
> the plan is to migrate everyone to ovn-kubernetes eventually, so the
> iptables-based kube-proxy will go away.

What could be the reason for customers sticking with iptables? Just for
historical reasons? In this concrete case, they migrated from 3.11 to 4.X - I
guess a good occasion to give OVN a try as well, no?

> > non-zero counters in there. Either this is a case of partial counter restore
> > and the node has rebooted before (well possible) or all the fuss is even
> > about no productive traffic.
> 
> Yeah, rules frequently get regenerated. You can't trust the counters.

OK, thanks for clarifying!

Comment 3 Dan Winship 2020-09-18 12:40:20 UTC

(In reply to Phil Sutter from comment #2)
> It's the latter, unless something has considerably changed since 4.3 (or
> maybe
> 3.11) as that (alegedly) has been working fine. Currently the customer has
> 15k
> routes and "everything" goes up in smoke. They claim with OCP3.11 they even
> had
> 27k and things were running smoothly.

Occasionally things regress and then we tweak things. The "check for conntrack state being NEW" was added after a rearrangement of rules a few releases back that had horrible effects on OpenShift Online...

> The tree idea goes like this:
> 
> | -A KUBE-SERVICES -d 172.30.0.0/18 -j KUBE-SERVICES-0
> | -A KUBE-SERVICES -d 172.30.64.0/18 -j KUBE-SERVICES-64

ah, I see. Yes. We'd thought about wrapping the whole thing in `-d 172.30.0.0/16`, but there are some rules in the chain that aren't actually in the service CIDR so you can't do that. And then we didn't follow up because things were _mostly_ not totally falling apart with the current rules...

> Since this chain has grown to 4k rules, would you say the customer is
> mis-using
> the product or has a configuration issue or something like that?

Not necessarily... With that many rules, presumably there are a lot of people working on the cluster, so if you have one user leaving a stray service behind here and another user leaving a few stray services behind over there, it starts to add up after a while even though no one person is doing anything especially bad.

> What could be the reason for customers sticking with iptables? Just for
> historical reasons? In this concrete case, they migrated from 3.11 to 4.X - I
> guess a good occasion to give OVN a try as well, no?

ovn-kubernetes is still not yet GA

Comment 4 Phil Sutter 2020-09-18 15:00:30 UTC

(In reply to Dan Winship from comment #3)
> (In reply to Phil Sutter from comment #2)
> > It's the latter, unless something has considerably changed since 4.3 (or
> > maybe
> > 3.11) as that (alegedly) has been working fine. Currently the customer has
> > 15k
> > routes and "everything" goes up in smoke. They claim with OCP3.11 they even
> > had
> > 27k and things were running smoothly.
> 
> Occasionally things regress and then we tweak things. The "check for
> conntrack state being NEW" was added after a rearrangement of rules a few
> releases back that had horrible effects on OpenShift Online...

So if a change in component a causes a regression you start searching for
tweaks in the others to make up for that?

> > The tree idea goes like this:
> > 
> > | -A KUBE-SERVICES -d 172.30.0.0/18 -j KUBE-SERVICES-0
> > | -A KUBE-SERVICES -d 172.30.64.0/18 -j KUBE-SERVICES-64
> 
> ah, I see. Yes. We'd thought about wrapping the whole thing in `-d
> 172.30.0.0/16`, but there are some rules in the chain that aren't actually
> in the service CIDR so you can't do that. And then we didn't follow up
> because things were _mostly_ not totally falling apart with the current
> rules...

What would that '-d 172.30.0.0/16' wrap improve? Assuming all packets you see
in there are destined for that subnet, you're just adding an extra match which
evaluates true in all cases, no?

> > Since this chain has grown to 4k rules, would you say the customer is
> > mis-using
> > the product or has a configuration issue or something like that?
> 
> Not necessarily... With that many rules, presumably there are a lot of
> people working on the cluster, so if you have one user leaving a stray
> service behind here and another user leaving a few stray services behind
> over there, it starts to add up after a while even though no one person is
> doing anything especially bad.

OK, so why are you not expecting it to grow "_that_ large" again? :)

> > What could be the reason for customers sticking with iptables? Just for
> > historical reasons? In this concrete case, they migrated from 3.11 to 4.X - I
> > guess a good occasion to give OVN a try as well, no?
> 
> ovn-kubernetes is still not yet GA

When will that happen? And are customers expected to migrate ASAP or will older
iptables-based setups still be supported?

Comment 5 Dan Winship 2020-09-18 16:46:17 UTC

(In reply to Phil Sutter from comment #4)
> (In reply to Dan Winship from comment #3)
> > Occasionally things regress and then we tweak things. The "check for
> > conntrack state being NEW" was added after a rearrangement of rules a few
> > releases back that had horrible effects on OpenShift Online...
> 
> So if a change in component a causes a regression you start searching for
> tweaks in the others to make up for that?

No, someone made a change to how kube-proxy organized the iptables rules (to fix another bug), and as a side effect of that reorganization, OpenShift Online performance cratered, which we were able to fix by adding the "conntrack state is NEW" check.

> > > The tree idea goes like this:
> > > 
> > > | -A KUBE-SERVICES -d 172.30.0.0/18 -j KUBE-SERVICES-0
> > > | -A KUBE-SERVICES -d 172.30.64.0/18 -j KUBE-SERVICES-64
> > 
> > ah, I see. Yes. We'd thought about wrapping the whole thing in `-d
> > 172.30.0.0/16`, but there are some rules in the chain that aren't actually
> > in the service CIDR so you can't do that. And then we didn't follow up
> > because things were _mostly_ not totally falling apart with the current
> > rules...
> 
> What would that '-d 172.30.0.0/16' wrap improve? Assuming all packets you see
> in there are destined for that subnet

They aren't; _all_ packets hit those chains currently, not just all packets destined for services.

But I remembered why we didn't add that check: because kube-proxy doesn't actually know the "172.30.0.0/16" CIDR. It sees individual service IPs, but it never knows for sure the CIDR that they're allocated out of...

(Though I guess we could just compute the smallest CIDR containing every IP it has seen.)

> > > Since this chain has grown to 4k rules, would you say the customer is
> > > mis-using
> > > the product or has a configuration issue or something like that?
> > 
> > Not necessarily... With that many rules, presumably there are a lot of
> > people working on the cluster, so if you have one user leaving a stray
> > service behind here and another user leaving a few stray services behind
> > over there, it starts to add up after a while even though no one person is
> > doing anything especially bad.
> 
> OK, so why are you not expecting it to grow "_that_ large" again? :)

Well, I mean it's easy to explain how you can have 4k out of 33k services be "unused" services by not paying attention. But you couldn't plausibly have 30k out of 33k services be unused, and you couldn't have 33k out of 275k services be unused because the system would have fallen over before you got to 275k services.

*shrug*

(And in case it wasn't clear, we can definitely work on optimizing the iptables rules some more if it will fix some actual customer problem. But if your customer's problem was somewhere else, then we have lots of other things to be working on instead...)

> > ovn-kubernetes is still not yet GA
> 
> When will that happen? And are customers expected to migrate ASAP or will
> older
> iptables-based setups still be supported?

It will be GA in 4.6 which will be released this fall, but it won't yet be the default for new installs and customers won't be actively encouraged to migrate yet. We'll be supporting openshift-sdn for a long time, but for increasingly smaller values of "supporting" over time...

Comment 6 Phil Sutter 2020-09-25 15:46:23 UTC

(In reply to Dan Winship from comment #5)
> (And in case it wasn't clear, we can definitely work on optimizing the
> iptables rules some more if it will fix some actual customer problem. But if
> your customer's problem was somewhere else, then we have lots of other
> things to be working on instead...)

The amount of iptables rules each haproxy health check has to traverse is a
relevant factor. It is still unclear why things got so much worse since OCP
4.3, maybe they were already at the system limit and the reduced reload
performance of haproxy did the trick. Anyway, they claim that OCP 3.11 was
stable with 27k routes instead of the 15k which kill them now. Who knows what
changed, maybe the reject rules to fix for those close wait sockets was added
meanwhile.

> > > ovn-kubernetes is still not yet GA
> > 
> > When will that happen? And are customers expected to migrate ASAP or will
> > older
> > iptables-based setups still be supported?
> 
> It will be GA in 4.6 which will be released this fall, but it won't yet be
> the default for new installs and customers won't be actively encouraged to
> migrate yet. We'll be supporting openshift-sdn for a long time, but for
> increasingly smaller values of "supporting" over time...

Doesn't sound like a good option to point customers at which are having
problems now, but maybe that's just my perspective.

Either way, back to the ruleset itself:

nat KUBE-SERVICES is full of rules like:

-A KUBE-SERVICES ! -s !172.20.0.0/16 -d IP_A -p tcp --dport PORT_A -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d IP_A -p tcp --dport PORT_A -j KUBE-SVC-FOO

One could move the first rule into KUBE-SVC-FOO and simplify it:

-I KUBE-SVC-FOO 1 ! -s !172.20.0.0/16 -j KUBE-MARK-MASQ

This shrinks nat KUBE-SERVICES in half. Same with KUBE-NODEPORTS, BTW: Move the
jump to KUBE-MARK-MASQ jump into the KUBE-SVC-FOO chain, halfs all non-Xlb-type
service rules.

In filter, KUBE-SERVICES contains only conditional rejects. Are you sure it
makes sense to jump to it from OUTPUT chain?

What about this KUBE-EXTERNAL-SERVICES, BTW? Your commit message states: "Split
out a KUBE-EXTERNAL-SERVICES chain so we don't have to run KUBE-SERVICES from
INPUT" - but with commit de25d6cb9577b ("Kube-proxy: REJECT LB IPs with no
endpoints"), KUBE-SERVICES was added again to INPUT but KUBE-EXTERNAL-SERVICES
left in place. Assuming that makes sense, one could drop the addrtype check
from rules added to it as probably every packet hitting INPUT chain should have
a destination address being categorized as LOCAL by the kernel. In doubt, that
match could be added to the rule jumping to it from INPUT so it happens once
and not for every rule within.

Comment 8 Dan Winship 2020-09-28 14:58:06 UTC

(In reply to Phil Sutter from comment #6)
> Doesn't sound like a good option to point customers at which are having
> problems now, but maybe that's just my perspective.

No, as I said, if customers are having problems now, we'll fix them. We're just not working on refactoring things to make them better where there _aren't_ customer issues.

> It is still unclear why things got so much worse since OCP 4.3

...

> What about this KUBE-EXTERNAL-SERVICES, BTW? Your commit message states:
> "Split
> out a KUBE-EXTERNAL-SERVICES chain so we don't have to run KUBE-SERVICES from
> INPUT" - but with commit de25d6cb9577b ("Kube-proxy: REJECT LB IPs with no
> endpoints"), KUBE-SERVICES was added again to INPUT

Ugh. That's most likely the culprit. (INPUT seems to be a lot more sensitive to having too many rules than other chains.)

Comment 9 Phil Sutter 2020-09-28 15:20:13 UTC

Hi Dan,

(In reply to Dan Winship from comment #8)
> (In reply to Phil Sutter from comment #6)
> > Doesn't sound like a good option to point customers at which are having
> > problems now, but maybe that's just my perspective.
> 
> No, as I said, if customers are having problems now, we'll fix them. We're
> just not working on refactoring things to make them better where there
> _aren't_ customer issues.

OK, maybe I misunderstood.

> > It is still unclear why things got so much worse since OCP 4.3
> 
> ...
> 
> > What about this KUBE-EXTERNAL-SERVICES, BTW? Your commit message states:
> > "Split
> > out a KUBE-EXTERNAL-SERVICES chain so we don't have to run KUBE-SERVICES from
> > INPUT" - but with commit de25d6cb9577b ("Kube-proxy: REJECT LB IPs with no
> > endpoints"), KUBE-SERVICES was added again to INPUT
> 
> Ugh. That's most likely the culprit. (INPUT seems to be a lot more sensitive
> to having too many rules than other chains.)

How can I find out which Kubernetes version a given OCP version is based upon?
Are there patches on top?

Thanks, Phil

Comment 10 Dan Winship 2020-09-28 15:35:03 UTC

OCP 3.11 = kube 1.11
OCP 4.1  = kube 1.13
OCP 4.2  = kube 1.14
OCP 4.3  = kube 1.16
OCP 4.4  = kube 1.17

Specifically, openshift-sdn in 4.2 and later use the kube-proxy code from the sdn-X.Y-kubernetes-X.Y.Z branches of https://github.com/openshift/kubernetes.

sdn-4.2-kubernetes-1.14.0 includes de25d6cb9577b so if that's the problem it should have started in 4.2

Comment 11 Dan Williams 2020-09-29 20:15:29 UTC

Over to Surya... Surya, can you coordinate with Dan Winship on what needs to be done here? Thanks!

Comment 12 Dan Williams 2020-09-29 20:16:58 UTC

Comment 8 is most relevant here (at least that's what we currently think the problem is).

Comment 13 Surya Seetharaman 2020-09-30 09:38:50 UTC

Just went through the bug:

From what I could understand we had a bug (https://github.com/kubernetes/kubernetes/issues/43212), then we did a bunch of fixes mainly added reject rules to INPUT and OUTPUT chains to drop traffic [nodeport/externalip/svcport] aimed at services with no endpoints/backing pods (https://github.com/kubernetes/kubernetes/pull/43972). This caused perf issues and a fix was attempted (https://github.com/kubernetes/kubernetes/pull/56164), where in order to avoid matching the packets against the reject rules meant for local (generated and destined) traffic in both INPUT and OUTPUT, the local versus external services chains were created (KUBE-SERVICES (cluster-ip) v/s KUBE-EXTERNAL-SERVICES (nodeport/external-ip)) and matching was continued only for the OUTPUT chain. Then we did https://github.com/kubernetes/kubernetes/pull/74394/commits/de25d6cb9577b00b24ca2baf0187b34f483a0d6c which reinstated the rules against INPUT chain again rendering the addition of kubeExternalServicesChain useless.

While I don't exactly posses the background history to completely understand everything in 4.2/4.3, I guess at least adding the option to filter based on interface instead of ip:port could help. Though my idea to add the NewDetectLocalByInterface (which is super easy since it just replaces the -s with -i) is not exactly related to this bug, I am more than happy to help out with this. I'll talk to Dan to find out a way forward to reduce the complexity.

Comment 14 Surya Seetharaman 2020-10-01 13:59:36 UTC

Update: I am working on an upstream patch (specifically on testing portion) to fix the filter table entries and move the lb reject rules to kube-external-services chain similar to what Dan did in #56164 and refrain from calling kube-services from within input chain. I don't know for sure if doing that alone would bring about a noticeable perf change or not, but worth a try.

Once that is done, we can see if we can also reduce the nat table entries.

Comment 15 Phil Sutter 2020-10-02 11:55:40 UTC

I would start with the nat table reorg as that's a small change which actually
halves the number of rules therein and in the referenced situation nat's
KUBE-SERVICES is ten times larger than filter's, but that's up to you.

A few other things to consider:

* Does it make sense to do the dnat and masquerade marking in both PREROUTING
  and OUTPUT? Are there outgoing connections that need dnat at all?

* Maybe the large number of rules in nat KUBE-SERVICES is even a bug in
  kube-proxy, maybe some missing cleanup. In the customer case, haproxy config
  has 14.6k backends defined. Assuming two rules per each backend, that's close
  to the 33k rules in nat KUBE-SERVICES, yet getting rid of the 4k extra rules
  could help as well.

* Do haproxy health checks need to pass through the ruleset at all? I could
  imagine it's completely sufficient to just masquerade any connection it
  opens. So maybe a few extra rules to identify and early accept it may suffice
  in eliminating the overhead its health checks cause.

Comment 16 Surya Seetharaman 2020-10-02 13:51:43 UTC

@Phil, agreed the nat table is huge, but I'd need to dig into that and see if all the rules present there have a reason to be or not.

Also thanks for the suggestions, I'll look into them. This will take some time since I am fairly new to this so I hope this bug is not having too much of a high priority (just asking so that I can devote time accordingly).

Here is the PR for avoiding the call of KUBE-SERVICES from INPUT for the filter table: https://github.com/kubernetes/kubernetes/pull/95252

Comment 17 Surya Seetharaman 2020-10-14 17:49:55 UTC

I have been looking into improving NAT table rules: At first it was frustrating because like Phil mentioned there are too many jumps for a single service (also k8s has too many service types -_-). For example a single ClusterIP service would have 6 rules created in NAT table to perform a DNATing in addition to other things:

-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.96.65.184/32 -p tcp -m comment --comment "default/example-service cluster IP" -m tcp --dport 8080 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.96.65.184/32 -p tcp -m comment --comment "default/example-service cluster IP" -m tcp --dport 8080 -j KUBE-SVC-QLK4GP7H447Z7HSM
-A KUBE-SVC-QLK4GP7H447Z7HSM -m comment --comment "default/example-service" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-53M442I75LOY2PNU
-A KUBE-SVC-QLK4GP7H447Z7HSM -m comment --comment "default/example-service" -j KUBE-SEP-UGQSD22WULK7O3L5
-A KUBE-SEP-UGQSD22WULK7O3L5 -s 10.244.2.4/32 -m comment --comment "default/example-service" -j KUBE-MARK-MASQ
-A KUBE-SEP-UGQSD22WULK7O3L5 -p tcp -m comment --comment "default/example-service" -m tcp -j DNAT --to-destination 10.244.2.4:8080


While I was trying to work on remving a few jumps, I was just presented with this: https://docs.google.com/drawings/d/1MtWL8qRTs6PlnJrW4dh8135_S9e2SaawT410bJuoBPk/edit as a justification to the design of the NAT table in kube-proxy.


> (In reply to Phil Sutter from comment #6)

> 
> Either way, back to the ruleset itself:
> 
> nat KUBE-SERVICES is full of rules like:
> 
> -A KUBE-SERVICES ! -s !172.20.0.0/16 -d IP_A -p tcp --dport PORT_A -j
> KUBE-MARK-MASQ
> -A KUBE-SERVICES -d IP_A -p tcp --dport PORT_A -j KUBE-SVC-FOO
> 
> One could move the first rule into KUBE-SVC-FOO and simplify it:
> 
> -I KUBE-SVC-FOO 1 ! -s !172.20.0.0/16 -j KUBE-MARK-MASQ
> 
> This shrinks nat KUBE-SERVICES in half. Same with KUBE-NODEPORTS, BTW: Move
> the
> jump to KUBE-MARK-MASQ jump into the KUBE-SVC-FOO chain, halfs all
> non-Xlb-type
> service rules.
> 


If we tried to do what you suggest, then I'd have to call each and every KUBE-SVC-hash from the OUTPUT/PREROUTING individually which I don't think is possible since the chains are added dynamically. So KUBE-SERVICES is the thing that is tying this whole service logic together. Plus I don't think it is possible to do this for the services other than clusterIP.

Anyways my biggest concern is that each jump seems to have been created to solve a specific problem. The code hasn't been touched in years and now if I do a major change like refactoring it I am not really sure what I would break.

I can continue to look into the flowchart and see if there is a jump that can be skipped to reduce the rules, but no promises.

Comment 18 Dan Winship 2020-10-28 13:56:02 UTC

(moving back to ASSIGNED since this still needs to be vendored into origin)

Comment 19 Phil Sutter 2020-10-29 13:51:49 UTC

Hi,

(In reply to Surya Seetharaman from comment #17)
> I have been looking into improving NAT table rules: At first it was
> frustrating because like Phil mentioned there are too many jumps for a
> single service (also k8s has too many service types -_-). For example a
> single ClusterIP service would have 6 rules created in NAT table to perform
> a DNATing in addition to other things:

Glad you found time to look into the ruleset.

> -A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.96.65.184/32 -p tcp -m comment
> --comment "default/example-service cluster IP" -m tcp --dport 8080 -j
> KUBE-MARK-MASQ
> -A KUBE-SERVICES -d 10.96.65.184/32 -p tcp -m comment --comment
> "default/example-service cluster IP" -m tcp --dport 8080 -j
> KUBE-SVC-QLK4GP7H447Z7HSM
> -A KUBE-SVC-QLK4GP7H447Z7HSM -m comment --comment "default/example-service"
> -m statistic --mode random --probability 0.50000000000 -j
> KUBE-SEP-53M442I75LOY2PNU
> -A KUBE-SVC-QLK4GP7H447Z7HSM -m comment --comment "default/example-service"
> -j KUBE-SEP-UGQSD22WULK7O3L5
> -A KUBE-SEP-UGQSD22WULK7O3L5 -s 10.244.2.4/32 -m comment --comment
> "default/example-service" -j KUBE-MARK-MASQ
> -A KUBE-SEP-UGQSD22WULK7O3L5 -p tcp -m comment --comment
> "default/example-service" -m tcp -j DNAT --to-destination 10.244.2.4:8080

Most of those are not critical. A packet not destined to 10.96.65.184 will only
see the first two rules as KUBE-SERVICES is traversed unconditionally.

> While I was trying to work on remving a few jumps, I was just presented with
> this:
> https://docs.google.com/drawings/d/
> 1MtWL8qRTs6PlnJrW4dh8135_S9e2SaawT410bJuoBPk/edit as a justification to the
> design of the NAT table in kube-proxy.

Interesting picture, thanks for the link.

> > (In reply to Phil Sutter from comment #6)
> 
> > 
> > Either way, back to the ruleset itself:
> > 
> > nat KUBE-SERVICES is full of rules like:
> > 
> > -A KUBE-SERVICES ! -s !172.20.0.0/16 -d IP_A -p tcp --dport PORT_A -j
> > KUBE-MARK-MASQ
> > -A KUBE-SERVICES -d IP_A -p tcp --dport PORT_A -j KUBE-SVC-FOO
> > 
> > One could move the first rule into KUBE-SVC-FOO and simplify it:
> > 
> > -I KUBE-SVC-FOO 1 ! -s !172.20.0.0/16 -j KUBE-MARK-MASQ
> > 
> > This shrinks nat KUBE-SERVICES in half. Same with KUBE-NODEPORTS, BTW: Move
> > the
> > jump to KUBE-MARK-MASQ jump into the KUBE-SVC-FOO chain, halfs all
> > non-Xlb-type
> > service rules.
> > 
> 
> 
> If we tried to do what you suggest, then I'd have to call each and every
> KUBE-SVC-hash from the OUTPUT/PREROUTING individually which I don't think is
> possible since the chains are added dynamically. So KUBE-SERVICES is the
> thing that is tying this whole service logic together. Plus I don't think it
> is possible to do this for the services other than clusterIP.

I don't get why you had to change OUTPUT/PREROUTING contents. Can you please
elaborate on that?

> Anyways my biggest concern is that each jump seems to have been created to
> solve a specific problem. The code hasn't been touched in years and now if I
> do a major change like refactoring it I am not really sure what I would
> break.

That sounds like the project has grown so much it is not maintainable anymore.

> I can continue to look into the flowchart and see if there is a jump that
> can be skipped to reduce the rules, but no promises.

Please don't get hung up on that flow-chart, it is not the reality. E.g. it
combines different setups into a single picture which are not present in
practice. This makes it seem much more complicated than it actually is. The way
I read it, the "Match a svc ..." conditionals are mutually exclusive. Drop all
but one and the chart looks much simpler.

Comment 20 Surya Seetharaman 2020-11-25 11:18:09 UTC

update: apologies for the delayed response, completely missed the needinfo but I am working on getting the merged upstream Filter table improvement into sdn. The rebase is ready and good to go.

Regarding the NAT table improvements:

(In reply to Phil Sutter from comment #19)
>
> > 
> > If we tried to do what you suggest, then I'd have to call each and every
> > KUBE-SVC-hash from the OUTPUT/PREROUTING individually which I don't think is
> > possible since the chains are added dynamically. So KUBE-SERVICES is the
> > thing that is tying this whole service logic together. Plus I don't think it
> > is possible to do this for the services other than clusterIP.
> 
> I don't get why you had to change OUTPUT/PREROUTING contents. Can you please
> elaborate on that?


sorry I guess I wasn't clear on this. So what I was trying to get at was that each of the KUBE-SVC chains that are created, get created on the fly with a random hash when the svc gets created.

> > -A KUBE-SERVICES ! -s !172.20.0.0/16 -d IP_A -p tcp --dport PORT_A -j
> > KUBE-MARK-MASQ
> > -A KUBE-SERVICES -d IP_A -p tcp --dport PORT_A -j KUBE-SVC-FOO
> > 
> > One could move the first rule into KUBE-SVC-FOO and simplify it:
> > 
> > -I KUBE-SVC-FOO 1 ! -s !172.20.0.0/16 -j KUBE-MARK-MASQ

So if we do the above, we will skip the KUBE-SERVICES chain and end up creating individual independent chains per service. Right now these SVC chains are tied together in a tree structure with the KUBE-SERVICES chain at the root. We control the versioning of addition and deletion of chains/rules and what gets called from what table in this manner:

https://github.com/kubernetes/kubernetes/blob/c1f36fa6f28d3618c03b65799bc3f58007624e5f/pkg/proxy/iptables/proxier.go#L395-L416

Its the KUBE-SERVICES chain that gets globally called from output and prerouting (https://github.com/kubernetes/kubernetes/blob/c1f36fa6f28d3618c03b65799bc3f58007624e5f/pkg/proxy/iptables/proxier.go#L401-L402) and then that triggers the KUBE-SVC-FOO chain in the flow.

> > -A KUBE-SERVICES ! -s !172.20.0.0/16 -d IP_A -p tcp --dport PORT_A -j
> > KUBE-MARK-MASQ
> > -A KUBE-SERVICES -d IP_A -p tcp --dport PORT_A -j KUBE-SVC-FOO

that is why they do it in two steps which I understand increases complexity; however because of the way it is designed, if we remove the link between KUBE-SERVICES and KUBE-SVC-FOO chains i.e the jump from KUBE-SERVICES to the respective KUBE-SVC-FOO, we'd have to call each of these FOO chains from the top i.e https://github.com/kubernetes/kubernetes/blob/c1f36fa6f28d3618c03b65799bc3f58007624e5f/pkg/proxy/iptables/proxier.go#L395 which would not be possible to track.

Comment 21 Phil Sutter 2020-11-25 12:32:13 UTC

Surya,

I didn't suggest to remove KUBE-SERVICES chain, merely to reorganize rules therein. Could you please give this patch a try:

--- a/pkg/proxy/iptables/proxier.go
+++ b/pkg/proxy/iptables/proxier.go
@@ -1054,13 +1054,15 @@ func (proxier *Proxier) syncProxyRules() {
 
                // Capture the clusterIP.
                if hasEndpoints {
-                       args = append(args[:0],
+                       writeLine(proxier.natRules,
                                "-A", string(kubeServicesChain),
                                "-m", "comment", "--comment", fmt.Sprintf(`"%s cluster IP"`, svcNameString),
                                "-m", protocol, "-p", protocol,
                                "-d", utilproxy.ToCIDR(svcInfo.ClusterIP()),
                                "--dport", strconv.Itoa(svcInfo.Port()),
+                               "-j", string(svcChain),
                        )
+                       args = append(args[:0], "-A", string(svcChain))
                        if proxier.masqueradeAll {
                                writeLine(proxier.natRules, append(args, "-j", string(KubeMarkMasqChain))...)
                        } else if proxier.localDetector.IsImplemented() {
@@ -1071,7 +1073,6 @@ func (proxier *Proxier) syncProxyRules() {
                                // If/when we support "Local" policy for VIPs, we should update this.
                                writeLine(proxier.natRules, proxier.localDetector.JumpIfNotLocal(args, string(KubeMarkMasqChain))...)
                        }
-                       writeLine(proxier.natRules, append(args, "-j", string(svcChain))...)
                } else {
                        // No endpoints.
                        writeLine(proxier.filterRules,

(It is not even compile-tested, but I hope you get the idea.)

Comment 22 Surya Seetharaman 2020-11-26 11:43:22 UTC

(In reply to Phil Sutter from comment #21)
> Surya,
> 
> I didn't suggest to remove KUBE-SERVICES chain, merely to reorganize rules
> therein. Could you please give this patch a try:
> 
> --- a/pkg/proxy/iptables/proxier.go
> +++ b/pkg/proxy/iptables/proxier.go
> @@ -1054,13 +1054,15 @@ func (proxier *Proxier) syncProxyRules() {
>  
>                 // Capture the clusterIP.
>                 if hasEndpoints {
> -                       args = append(args[:0],
> +                       writeLine(proxier.natRules,
>                                 "-A", string(kubeServicesChain),
>                                 "-m", "comment", "--comment",
> fmt.Sprintf(`"%s cluster IP"`, svcNameString),
>                                 "-m", protocol, "-p", protocol,
>                                 "-d", utilproxy.ToCIDR(svcInfo.ClusterIP()),
>                                 "--dport", strconv.Itoa(svcInfo.Port()),
> +                               "-j", string(svcChain),
>                         )
> +                       args = append(args[:0], "-A", string(svcChain))
>                         if proxier.masqueradeAll {
>                                 writeLine(proxier.natRules, append(args,
> "-j", string(KubeMarkMasqChain))...)
>                         } else if proxier.localDetector.IsImplemented() {
> @@ -1071,7 +1073,6 @@ func (proxier *Proxier) syncProxyRules() {
>                                 // If/when we support "Local" policy for
> VIPs, we should update this.
>                                 writeLine(proxier.natRules,
> proxier.localDetector.JumpIfNotLocal(args, string(KubeMarkMasqChain))...)
>                         }
> -                       writeLine(proxier.natRules, append(args, "-j",
> string(svcChain))...)
>                 } else {
>                         // No endpoints.
>                         writeLine(proxier.filterRules,
> 
> (It is not even compile-tested, but I hope you get the idea.)

ahh yes of course now I get it (I am relatively new to the concepts, sorry for not catching this earlier). I completely misunderstood what you were getting at, my bad. Actually I think its because I myself had an idea to somehow remove one of the KUBE-SERVICES jumps and ended up thinking you were suggesting the same.

Yep I get it. So you wanna move the whole masquerade marking rule condition to the SVC-FOO chain so that it gets traversed only for that specific svc and not every time globally. Makes sense. I checked with the flow chart and I don't see any harm for clusterIP/nodeports and lbs-for the non-local svc traffic policy/externalIP-for the localIP/from-off-node cases.

I'll try this idea out and see if there are any e2e tests failing or if there is any scenario that we haven't considered and if its clean I'll submit a PR and see what other's say.

Comment 23 Phil Sutter 2020-11-26 14:11:17 UTC

(In reply to Surya Seetharaman from comment #22)
> ahh yes of course now I get it (I am relatively new to the concepts, sorry
> for not catching this earlier). I completely misunderstood what you were
> getting at, my bad. Actually I think its because I myself had an idea to
> somehow remove one of the KUBE-SERVICES jumps and ended up thinking you were
> suggesting the same.

A patch is worth a thousand words! :)

> Yep I get it. So you wanna move the whole masquerade marking rule condition
> to the SVC-FOO chain so that it gets traversed only for that specific svc
> and not every time globally. Makes sense. I checked with the flow chart and
> I don't see any harm for clusterIP/nodeports and lbs-for the non-local svc
> traffic policy/externalIP-for the localIP/from-off-node cases.

Exactly. The goal is to reduce the number of rules each packet hitting
KUBE-SERVICES has to traverse. AFAICT, there are more spots that allow for this
optimization, but the one above is the lowest hanging fruit.

> I'll try this idea out and see if there are any e2e tests failing or if
> there is any scenario that we haven't considered and if its clean I'll
> submit a PR and see what other's say.

OK, fine. Please ping me in case of failures, so we can debug this together.
We're in the same timezone, BTW.

Cheers, Phil

Comment 24 Surya Seetharaman 2020-11-26 15:45:45 UTC

(In reply to Phil Sutter from comment #23)

> 
> > I'll try this idea out and see if there are any e2e tests failing or if
> > there is any scenario that we haven't considered and if its clean I'll
> > submit a PR and see what other's say.
> 
> OK, fine. 

yes I am doing this currently, so far so good; haven't seen anything break. I'd like to also include the nodeports/externalips/loadbalancer masquerading portions into the KUBE-SVC-FOO chain. Will have a patch asap with tests. I'll make sure to include your email in the credits of the patch.

> Please ping me in case of failures, so we can debug this together.


Will definitely keep you posted Phil.

> We're in the same timezone, BTW.
> 
> Cheers, Phil

good to know :) 

/Surya.

Comment 25 Surya Seetharaman 2020-11-30 17:47:18 UTC

PR: https://github.com/kubernetes/kubernetes/pull/96959 Let's see what CI/reviewers say.