Bug 2115039 - [4.8] On updating cluster from 4.8.34=>4.8.43, cu has noticed stale iptables rules that cause SVC of type LB to fail after redeployment of pods
Summary: [4.8] On updating cluster from 4.8.34=>4.8.43, cu has noticed stale iptables ...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: Unspecified
OS: Linux
Target Milestone: ---
: 4.8.z
Assignee: Surya Seetharaman
QA Contact: Arti Sood
Depends On: 2115845
TreeView+ depends on / blocked
Reported: 2022-08-03 18:32 UTC by milti leonard
Modified: 2022-09-14 20:39 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2115845 (view as bug list)
Last Closed: 2022-09-14 20:38:59 UTC
Target Upstream Version:
surya: needinfo-

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 1234 0 None open Bug 2115039: [release-4.8] OCP CARRY: Add the rules to EXTERNALIPs only for SGW mode 2022-08-05 16:47:59 UTC
Red Hat Product Errata RHSA-2022:6308 0 None None None 2022-09-14 20:39:15 UTC

Description milti leonard 2022-08-03 18:32:55 UTC
Description of problem:
Our tenant is creating a Service of type LoadBalancer.
We use OVN in local gateway mode.
So we expect IPTABLE rules to be present in the kernel iptables.
We do see the IPTABLE rules present but there is stale IPTABLE rule that is preventing the SVC from working.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
pls see comment below

Actual results:
upon redeployment, the iptables retain the stale IP for the LB-SVC which causes connections to fail; a reboot of the node will correct the iptables; this was NOT happening in 4.8.34 and only started after upgrade of cluster to 4.8.43

Expected results:
that upon redeployment, the iptables will refresh to include the new IP for the SVC-IP

Additional info:

Comment 2 milti leonard 2022-08-03 19:21:57 UTC
hello, ive included the steps to reproduce the issue in the comment above. as stated earlier, this issue didnt appear until after the upgrade to 4.8.43 (from 4.8.34). my colleague @jocolema has found the likely PR causing this, pls see his SFDC comment below:

I wanted to let you know where I am at here.  I have been searching through some release notes to see if I could find a change that is similar to what we see between 4.8.34 and 4.8.43.  I have found the following change in 4.8.36 which _might_ be relevant:

 Going back through the release notes, I see the following in 4.8.36:

 The errata this posted has a bug fix here:

 I see that this points us to the PR:

This change is essentially focused around components like metalLB, and problem statement:

 'Services of type loadbalancer do not work if the traffic reaches the node from an interface different from br-ex'

So I am still trying to figure out if this is the change you are seeing and metalLB is just primary purpose whereas fix is more global to iptables.  If so, I am wondering if the focus is on shared gateway and hence we might be hitting a bug while using local gateway.  

ill include further information on how OVN is used in the cu cluster from ticket in another comment. the cu is a telco and are utilising the ericsson blueprint dualmode for 5gc deployments [1] (or E///). they are prohibited from upgrading to 4.10 until E/// is upgraded as well, its likely that the metalLB use here is pursuant on that.

there are gathers/sosreports/iptables_files attached to the SFDC case but were taken before the steps outlined above: ive requested fresh data to hopefully show what was happening during the reproducer. once those are attached, this BZ will be updated to note it.

[1] https://gitlab.consulting.redhat.com/djuran/ericsson-blueprint/-/tree/release_5GC_TS-1.3#user-content-sr-iov

Comment 5 milti leonard 2022-08-03 19:26:29 UTC
ive submitted significant comments from the cu case ticket above. pls let me know whether there is anything further needed to work this case. currently there is a work-around, but the cu is asking for a fix on this issue as its likely that the PRs hi-lited in the previous comment broke something. ive marked the severity as high, but the cu is one of high-priority.

Comment 7 milti leonard 2022-08-03 21:26:24 UTC
cu has attached a fresh cluster/network gather to the ticket [1], sosreports from the nodes where the pods in the test were scheduled [2]; it would appear that the iptables information is included already in the network gather in [1]. pls let me know whether anything further is required to begin this investigation

[1] https://attachments.access.redhat.com/hydra/rest/cases/03277795/attachments/62598b2f-3a70-4c42-bc28-597c202bcb9f
[2] https://attachments.access.redhat.com/hydra/rest/cases/03277795/attachments/a0bc5a43-86e5-419c-a590-3475cb8ae87d

Comment 10 Surya Seetharaman 2022-08-04 19:57:42 UTC
// Debugging Notes
After investigating a bit more, realised what is happening. Fixing https://github.com/openshift/ovn-kubernetes/pull/967 and https://github.com/openshift/ovn-kubernetes/pull/905 indirectly contributed to this bug.
It's a slightly nasty bug that will only effect users:
1) on LGW mode,
2) using LB svcs where svc.Status.loadBalancer is set
3) on 4.8 clusters if >= OCP 4.8.36
4) on 4.9 clusters if >= OCP 4.9.23

** Temporary Workaround of course is to restart ovnkube-node until we run into this issue again.

OCP 4.8.43:

ovnkube-node startup:

switch config.Gateway.Mode {
	case config.GatewayModeLocal:
		klog.Info("Preparing Local Gateway")
		gw, err = newLocalGateway(n.name, subnets, gatewayNextHops, gatewayIntf, nodeAnnotator, n.recorder, managementPortConfig)

followed by:

initGw := func() error {
		return gw.Init(n.watchFactory)

followed by:

func (g *gateway) Init(wf factory.NodeWatchFactory) error {
	err := g.initFunc()
	if err != nil {
		return err
		AddFunc: func(obj interface{}) {
			svc := obj.(*kapi.Service)
		UpdateFunc: func(old, new interface{}) {
			oldSvc := old.(*kapi.Service)
			newSvc := new.(*kapi.Service)
			g.UpdateService(oldSvc, newSvc)
		DeleteFunc: func(obj interface{}) {
			svc := obj.(*kapi.Service)
	}, g.SyncServices)

followed by:

func (g *gateway) SyncServices(objs []interface{}) {
	if g.localPortWatcher != nil {
func (g *gateway) AddService(svc *kapi.Service) {
	if g.localPortWatcher != nil {

followed by:

PATH-A leads to:
func (l *localPortWatcher) SyncServices(serviceInterface []interface{}) {
	keepIPTRules := []iptRule{}
	for _, service := range serviceInterface {
		svc, ok := service.(*kapi.Service)
		if !ok {
			klog.Errorf("Spurious object in syncServices: %v", serviceInterface)
		keepIPTRules = append(keepIPTRules, getGatewayIPTRules(svc, []string{l.gatewayIPv4, l.gatewayIPv6})...) ---> PROBLEM IS HERE!! 
	for _, chain := range []string{iptableNodePortChain, iptableExternalIPChain} {
		recreateIPTRules("nat", chain, keepIPTRules)

	// Previously LGW used routes in the localnetGatewayExternalIDTable, to handle
	// upgrades correctly make sure we flush this table of all routes
	klog.Infof("Flushing host's routing table: %s", localnetGatewayExternalIDTable)
	if _, stderr, err := util.RunIP("route", "flush", "table", localnetGatewayExternalIDTable); err != nil {
		klog.Errorf("Error flushing host's routing table: %s stderr: %s err: %v", localnetGatewayExternalIDTable, stderr, err)

followed by:

func getGatewayIPTRules(service *kapi.Service, gatewayIPs []string) []iptRule {
		externalIPs := make([]string, 0, len(service.Spec.ExternalIPs)+len(service.Status.LoadBalancer.Ingress))
		externalIPs = append(externalIPs, service.Spec.ExternalIPs...)
		for _, ingress := range service.Status.LoadBalancer.Ingress { -----> NOW WE ADD THESE LB RULES TO EXTERNALIP CHAIN FOR SGW
			if len(ingress.IP) > 0 {
				externalIPs = append(externalIPs, ingress.IP)

		for _, externalIP := range externalIPs {
			err := util.ValidatePort(svcPort.Protocol, svcPort.Port)
			if err != nil {
				klog.Errorf("Skipping service: %s, invalid service port %v", svcPort.Name, err)
			if clusterIP, err := util.MatchIPStringFamily(utilnet.IsIPv6String(externalIP), clusterIPs); err == nil {
				rules = append(rules, getExternalIPTRules(svcPort, externalIP, clusterIP)...)
	return rules

PATH-B leads to:
func (l *localPortWatcher) addService(svc *kapi.Service) error {
	// don't process headless service or services that do not have NodePorts or ExternalIPs
	if !util.ServiceTypeHasClusterIP(svc) || !util.IsClusterIPSet(svc) {
		return nil

	for _, ip := range util.GetClusterIPs(svc) {
		iptRules := []iptRule{}
		isIPv6Service := utilnet.IsIPv6String(ip)
		gatewayIP := l.gatewayIPv4
		if isIPv6Service {
			gatewayIP = l.gatewayIPv6
		for _, port := range svc.Spec.Ports {
			// Fix Azure/GCP LoadBalancers. They will forward traffic directly to the node with the
			// dest address as the load-balancer ingress IP and port
			iptRules = append(iptRules, getLoadBalancerIPTRules(svc, port, ip, port.Port)...) ----> NOW WE ADD THESE LB RULES TO NODEPORT CHAIN FOR LGW

So really the problem is with calling getGatewayIPTRules from LGW and assuming all the rules are calculated the same way for both the modes. When we fixed the SGW bug, we kinda deviated in the rule addition this causing this bug.

Fix should be a one-liner of checking gateway-mode before adding LB.Status.Ingress rules to EXTERNALIPS chain.
NOTE: This fix will only go into 4.9 and 4.8, as doing this one-liner will break >= 4.10 where we do things the same way for both the modes.

Comment 12 Surya Seetharaman 2022-08-04 21:29:13 UTC
Able to reproduce this on GCP as well VSphere+MetalLB:


Switch to LGW mode:
Create LB svc:

sh-4.4# iptables-save | grep 104.
-A OVN-KUBE-NODEPORT -d -p tcp -m tcp --dport 8080 -j DNAT --to-destination

restart ovnkube-node: 

sh-4.4# iptables-save | grep 104.
-A OVN-KUBE-NODEPORT -d -p tcp -m tcp --dport 8080 -j DNAT --to-destination
-A OVN-KUBE-EXTERNALIP -d -p tcp -m tcp --dport 8080 -j DNAT --to-destination

We end up with 2 rules. Tested the quick fix:
sh-4.4# iptables-save | grep 104.
-A gcp-vips -d -j REDIRECT
-A OVN-KUBE-NODEPORT -d -p tcp -m tcp --dport 8080 -j DNAT --to-destination
we ended up with only 1 again.

With metalLB+vsphere: {Bigger problem if we reuse the same SVC VIP for LB; like we do below}

[surya@hidden-temple ovn-kubernetes]$ oc get svc -n a1                                                                                                                       
NAME          TYPE           CLUSTER-IP      EXTERNAL-IP    PORT(S)        AGE                                                                                               
hello-world   LoadBalancer   80:32509/TCP   174m                                                                                              

sh-4.4# iptables-save -c | grep
[0:0] -A OVN-KUBE-NODEPORT -d -p tcp -m tcp --dport 80 -j DNAT --to-destination
[0:0] -A OVN-KUBE-EXTERNALIP -d -p tcp -m tcp --dport 80 -j DNAT --to-destination

recreating the svc with different clusterIP but same LB VIP will leave us with stale entries in EXTERNALIPs chain:

[surya@hidden-temple ovn-kubernetes]$ oc get svc -n a1                                                                                                                       
NAME          TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE                                                                                                
hello-world   LoadBalancer   <pending>     80:31835/TCP   18s 

sh-4.4# iptables-save -c | grep
[0:0] -A OVN-KUBE-NODEPORT -d -p tcp -m tcp --dport 80 -j DNAT --to-destination
[0:0] -A OVN-KUBE-EXTERNALIP -d -p tcp -m tcp --dport 80 -j DNAT --to-destination

Testing my fix on metalLB+vsphere:

[surya@hidden-temple ovn-kubernetes]$ oc get svc -n a1                                                                                                                       
NAME          TYPE           CLUSTER-IP      EXTERNAL-IP    PORT(S)        AGE                                                                                               
hello-world   LoadBalancer   80:31835/TCP   51s 
sh-4.4# iptables-save -c | grep
[0:0] -A OVN-KUBE-NODEPORT -d -p tcp -m tcp --dport 80 -j DNAT --to-destination

even after recreating the svc and/or restarting the pod it stays the same and gets updated correctly, no false rules in EXTERNALIPs chain:
[surya@hidden-temple ovn-kubernetes]$ oc get svc -n a1
NAME          TYPE           CLUSTER-IP      EXTERNAL-IP    PORT(S)        AGE
hello-world   LoadBalancer   80:30606/TCP   13s
sh-4.4# iptables-save -c | grep
[0:0] -A OVN-KUBE-NODEPORT -d -p tcp -m tcp --dport 80 -j DNAT --to-destination

Thanks Arti for helping with setup/testing!

Comment 21 errata-xmlrpc 2022-09-14 20:38:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.8.49 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.