Bug 1643304

Summary:	firewalld reload causes namespace wide egress IP to stop working
Product:	OpenShift Container Platform	Reporter:	Taneem Ibrahim <tibrahim>
Component:	Networking	Assignee:	Dan Winship <danw>
Status:	CLOSED ERRATA	QA Contact:	Meng Bo <bmeng>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.7.1	CC:	aos-bugs, bmeng, mcurry, tibrahim, weliang, wmeng
Target Milestone:	---	Keywords:	NeedsTestCase
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Egress IP-related iptables rules were not recreated if they got deleted. Consequence: If a user restarted firewalld or iptables.service on a node that hosted egress IPs, then those egress IPs would stop working. (Traffic that should have used the egress IP would use the node's normal IP instead.) Fix: Egress IP iptables rules are now recreated if they are removed. Result: Egress IPs work reliably.	Story Points:	---
Clone Of:
Clones:	1653380 1653381 1653382 1653384 (view as bug list)		Environment:
Last Closed:	2019-06-04 10:40:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Taneem Ibrahim 2018-10-25 21:15:09 UTC

Description of problem:

firewall-cmd reload (even when there are no rule changes) causes iptables reload error and removes egress IP rules. To resolve it, we have to run oc patch hostsubnet to remove and add the egress IP back to the individual namespaces. 

Version-Release number of selected component (if applicable):

v3.7.46

How reproducible:

Always

Steps to Reproduce:
1. Follow the instructions below to enable static egress IP:
https://docs.openshift.com/container-platform/3.7/admin_guide/managing_networking.html#enabling-static-ips-for-external-project-traffic

2. Run: firewall-cmd reload


Actual results:

Following IPTable rules are thrown:

Oct 24 18:46:59  firewalld[1071]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w2 -t nat -n -L DOCKER' failed: iptables: No chain/target/match by that name.
...
...
Oct 24 18:46:59  firewalld[1071]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w2 -t nat -C POSTROUTING -s <redacted>/16 ! -o docker0 -j MASQUERADE' failed: iptables: No chain/target/match by that name.



Expected results:

Egress IP should work when firewalld is enabled.


Additional info:

Comment 4 Dan Winship 2018-11-07 15:22:11 UTC

@Meng Bo: can you try this again? It won't fail completely, but the egress traffic will end up using the node's normal IP rather than the egress IP:

1. Set up a cluster with firewalld running on the nodes
2. Set up an egress IP, test that it works
3. On the node with the egress IP, run "firewall-cmd --reload"
4. Try egress from a pod again, see that it uses the node IP rather than the egress IP

Comment 6 Meng Bo 2018-11-08 07:27:14 UTC

Hmm...

Yes, I got the problem result now.

After firewall-cmd --reload, the pod will use the node's IP as source IP instead of egressIP.

The reason should be the condition which Weibin discovered. 

Thanks, Weibin!

Comment 7 Dan Winship 2018-11-08 14:15:56 UTC

(In reply to Weibin Liang from comment #5)
> But egreeIP rule can be restored in iptalbes if continue running systemctl
> restart openvsitch/docker/atomic-openshift-node.

Sure, but you're not supposed to have to do that.

Fixed by https://github.com/openshift/origin/pull/21441. I'll do backports after that merges.

Comment 8 Dan Winship 2018-11-19 18:42:40 UTC

So do we need this backported to 3.7 or is the customer happy with their current workaround? (Or planning to upgrade to something newer than 3.7 soon?)

Comment 13 Meng Bo 2018-12-03 09:29:04 UTC

Tested on ocp 3.11.50
The issue has been fixed.

Comment 16 errata-xmlrpc 2019-06-04 10:40:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758