1689690 – starter-us-east-2 & 2a experienced outage because kube ip not present in master iptables: tcp 172.30.0.1:443: getsockopt: no route to host

Bug 1689690 - starter-us-east-2 & 2a experienced outage because kube ip not present in master iptables: tcp 172.30.0.1:443: getsockopt: no route to host

Summary: starter-us-east-2 & 2a experienced outage because kube ip not present in mast...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Dan Williams
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1651784 1665763 1668414 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-17 18:57 UTC by Justin Pierce
Modified:	2022-03-13 17:03 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-11-20 14:42:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Justin Pierce 2019-03-17 18:57:02 UTC

Description of problem:

# Problem was detected by web-console pods failing to start:
[root@starter-us-east-2a-master-a7116 ~]# oc get pods -n openshift-web-console -o=wide
NAME                          READY     STATUS             RESTARTS   AGE       IP             NODE                                          NOMINATED NODE
webconsole-64c9fb4b6d-4lqp6   1/1       Running            24         1d        10.130.1.118   ip-172-31-69-150.us-east-2.compute.internal   <none>
webconsole-64c9fb4b6d-6pp6c   1/1       Running            62         1d        10.128.0.184   ip-172-31-69-80.us-east-2.compute.internal    <none>
webconsole-64c9fb4b6d-f2vsv   0/1       CrashLoopBackOff   391        1d        10.129.0.209   ip-172-31-75-97.us-east-2.compute.internal    <none>


# Pods a failing because they cannot find route to kube master
[root@starter-us-east-2a-master-a7116 ~]# oc logs webconsole-64c9fb4b6d-f2vsv -n openshift-web-console
Error: Get https://172.30.0.1:443/.well-known/oauth-authorization-server: dial tcp 172.30.0.1:443: getsockopt: no route to host


# The pod was running on a master and the kube route was not present in iptables
[root@starter-us-east-2a-master-cecf1 ~]# iptables-save | grep 172.30.0.1/

# I noted that iptables was failing to start because of an seemingly invalid PSAD rule in /etc/sysconfig/iptables:
[root@starter-us-east-2a-master-cecf1 ~]# systemctl status iptables
● iptables.service - IPv4 firewall with iptables
   Loaded: loaded (/usr/lib/systemd/system/iptables.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Fri 2019-03-15 20:36:53 UTC; 1 day 21h ago
  Process: 9339 ExecStart=/usr/libexec/iptables/iptables.init start (code=exited, status=1/FAILURE)
 Main PID: 9339 (code=exited, status=1/FAILURE)

Mar 15 20:36:52 ip-172-31-75-97.us-east-2.compute.internal systemd[1]: Starting IPv4 firewall with iptables...
Mar 15 20:36:53 ip-172-31-75-97.us-east-2.compute.internal iptables.init[9339]: iptables: Applying firewall rules: iptables-restore: line 350 failed
Mar 15 20:36:53 ip-172-31-75-97.us-east-2.compute.internal iptables.init[9339]: [FAILED]
Mar 15 20:36:53 ip-172-31-75-97.us-east-2.compute.internal systemd[1]: iptables.service: main process exited, code=exited, status=1/FAILURE
Mar 15 20:36:53 ip-172-31-75-97.us-east-2.compute.internal systemd[1]: Failed to start IPv4 firewall with iptables.
Mar 15 20:36:53 ip-172-31-75-97.us-east-2.compute.internal systemd[1]: Unit iptables.service entered failed state.
Mar 15 20:36:53 ip-172-31-75-97.us-east-2.compute.internal systemd[1]: iptables.service failed.

# The rule that seems to be impossible to load (I believe ops security added this):
-A OUTPUT -m comment --comment id_output_psad_logging_1_ -o eth0 -p tcp -m tcp --dport 22 -m state --state NEW -m hashlimit --hashlimit-above 40/sec --hashlimit-burst 60 --hashlimit-mode srcip --hashlimit-name psad3 -j LOG --log-prefix "PSAD:"


It is not clear what factors contribute to the kube route not being created by the SDN in iptables.


Version-Release number of selected component (if applicable):
v3.11.82


How reproducible:
100% on affected masters (restarts had no effect)

Actual results:
pods could not communicate with kube b/c iptables lacked an entry.

Expected results:
Whatever was causing the failure would ideally be detected and reported in sdn or ovs logs. 


Additional info:

http://file.rdu.redhat.com/~jupierce/share/no-route-to-kube.tgz elements of the system I collected before trying to solve the problem:
- iptables-save output before working on fixing the cluster
- journal entries from iptables server
- a copy of /etc/sysconfig/iptables

Steps taken to get this cluster working again:
mv /etc/sysconfig/iptables /etc/sysconfig/iptables.containsbak
systemctl stop atomic-openshift-node
systemctl stop docker
systemctl stop iptables
systemctl disable iptables
systemctl mask iptables
iptables -F
iptables -tnat --flush
systemctl start docker
systemctl start atomic-openshift-node
delete pods for ovs & sdn for affected master
*wait a few minutes*
eventually, `iptables-save | grep 172.30.0.1/` begins returning results

Comment 1 Justin Pierce 2019-03-17 19:34:05 UTC

- In at least one situation, the listed procedure was not sufficient. I had to reboot the master before the 172.30.0.1 iptables entry was established. 
- In clearing cluster 2a, I also had to delete the router pods to get the web console loading again.

Comment 4 Douglas Edgar 2019-03-18 16:34:39 UTC

Just for some background, it would be very unusual for issues with the PSAD iptables rule to be popping up now. It's been in place across all starter and OSIO clusters for upwards of 6 months across multiple reboots with no reported issues.

My guess is that something else changed within iptables recently, which either:
1. Clashes with the existing PSAD rule
2. Changes something that the PSAD rule expected to stay the same, suddenly making it invalid

Comment 5 Douglas Edgar 2019-03-18 22:32:28 UTC

PSAD configuration management has been disabled and, and its projects have been removed from OSIO and starter clusters to aid the troubleshooting efforts. I'll try to reproduce the issues in int or stg in the meantime.

Comment 17 Ben Bennett 2019-03-26 16:07:29 UTC

*** Bug 1651784 has been marked as a duplicate of this bug. ***

Comment 18 Ben Bennett 2019-03-26 16:08:01 UTC

*** Bug 1665763 has been marked as a duplicate of this bug. ***

Comment 19 Dan Williams 2019-03-28 21:04:33 UTC

*** Bug 1668414 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.