Description of problem: # Problem was detected by web-console pods failing to start: [root@starter-us-east-2a-master-a7116 ~]# oc get pods -n openshift-web-console -o=wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE webconsole-64c9fb4b6d-4lqp6 1/1 Running 24 1d 10.130.1.118 ip-172-31-69-150.us-east-2.compute.internal <none> webconsole-64c9fb4b6d-6pp6c 1/1 Running 62 1d 10.128.0.184 ip-172-31-69-80.us-east-2.compute.internal <none> webconsole-64c9fb4b6d-f2vsv 0/1 CrashLoopBackOff 391 1d 10.129.0.209 ip-172-31-75-97.us-east-2.compute.internal <none> # Pods a failing because they cannot find route to kube master [root@starter-us-east-2a-master-a7116 ~]# oc logs webconsole-64c9fb4b6d-f2vsv -n openshift-web-console Error: Get https://172.30.0.1:443/.well-known/oauth-authorization-server: dial tcp 172.30.0.1:443: getsockopt: no route to host # The pod was running on a master and the kube route was not present in iptables [root@starter-us-east-2a-master-cecf1 ~]# iptables-save | grep 172.30.0.1/ # I noted that iptables was failing to start because of an seemingly invalid PSAD rule in /etc/sysconfig/iptables: [root@starter-us-east-2a-master-cecf1 ~]# systemctl status iptables ● iptables.service - IPv4 firewall with iptables Loaded: loaded (/usr/lib/systemd/system/iptables.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Fri 2019-03-15 20:36:53 UTC; 1 day 21h ago Process: 9339 ExecStart=/usr/libexec/iptables/iptables.init start (code=exited, status=1/FAILURE) Main PID: 9339 (code=exited, status=1/FAILURE) Mar 15 20:36:52 ip-172-31-75-97.us-east-2.compute.internal systemd[1]: Starting IPv4 firewall with iptables... Mar 15 20:36:53 ip-172-31-75-97.us-east-2.compute.internal iptables.init[9339]: iptables: Applying firewall rules: iptables-restore: line 350 failed Mar 15 20:36:53 ip-172-31-75-97.us-east-2.compute.internal iptables.init[9339]: [FAILED] Mar 15 20:36:53 ip-172-31-75-97.us-east-2.compute.internal systemd[1]: iptables.service: main process exited, code=exited, status=1/FAILURE Mar 15 20:36:53 ip-172-31-75-97.us-east-2.compute.internal systemd[1]: Failed to start IPv4 firewall with iptables. Mar 15 20:36:53 ip-172-31-75-97.us-east-2.compute.internal systemd[1]: Unit iptables.service entered failed state. Mar 15 20:36:53 ip-172-31-75-97.us-east-2.compute.internal systemd[1]: iptables.service failed. # The rule that seems to be impossible to load (I believe ops security added this): -A OUTPUT -m comment --comment id_output_psad_logging_1_ -o eth0 -p tcp -m tcp --dport 22 -m state --state NEW -m hashlimit --hashlimit-above 40/sec --hashlimit-burst 60 --hashlimit-mode srcip --hashlimit-name psad3 -j LOG --log-prefix "PSAD:" It is not clear what factors contribute to the kube route not being created by the SDN in iptables. Version-Release number of selected component (if applicable): v3.11.82 How reproducible: 100% on affected masters (restarts had no effect) Actual results: pods could not communicate with kube b/c iptables lacked an entry. Expected results: Whatever was causing the failure would ideally be detected and reported in sdn or ovs logs. Additional info: http://file.rdu.redhat.com/~jupierce/share/no-route-to-kube.tgz elements of the system I collected before trying to solve the problem: - iptables-save output before working on fixing the cluster - journal entries from iptables server - a copy of /etc/sysconfig/iptables Steps taken to get this cluster working again: mv /etc/sysconfig/iptables /etc/sysconfig/iptables.containsbak systemctl stop atomic-openshift-node systemctl stop docker systemctl stop iptables systemctl disable iptables systemctl mask iptables iptables -F iptables -tnat --flush systemctl start docker systemctl start atomic-openshift-node delete pods for ovs & sdn for affected master *wait a few minutes* eventually, `iptables-save | grep 172.30.0.1/` begins returning results
- In at least one situation, the listed procedure was not sufficient. I had to reboot the master before the 172.30.0.1 iptables entry was established. - In clearing cluster 2a, I also had to delete the router pods to get the web console loading again.
Just for some background, it would be very unusual for issues with the PSAD iptables rule to be popping up now. It's been in place across all starter and OSIO clusters for upwards of 6 months across multiple reboots with no reported issues. My guess is that something else changed within iptables recently, which either: 1. Clashes with the existing PSAD rule 2. Changes something that the PSAD rule expected to stay the same, suddenly making it invalid
PSAD configuration management has been disabled and, and its projects have been removed from OSIO and starter clusters to aid the troubleshooting efforts. I'll try to reproduce the issues in int or stg in the meantime.
*** Bug 1651784 has been marked as a duplicate of this bug. ***
*** Bug 1665763 has been marked as a duplicate of this bug. ***
*** Bug 1668414 has been marked as a duplicate of this bug. ***