Description of problem: Pods with hostports somehow lose these hostports in iptables. Version-Release number of selected component (if applicable): 3.7.72-1 How reproducible: Unconfirmed Actual results: See attachments Additional info: Similar issue to https://bugzilla.redhat.com/show_bug.cgi?id=1629419 in 3.6 but was not resolved by updating to 3.7 Note that we have the following time range for the loss of the rules: 11Jun 12:00 GMT - 12Jun 02:15 GMT. We see the atomic-openshift-node service restarted at that time on all 3 nodes (cause currently unknown): ``` [TEST1][user@ose3-int-a-minion-i11 ~]$ systemctl status atomic-openshift-node ● atomic-openshift-node.service - OpenShift Node Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled) Drop-In: /usr/lib/systemd/system/atomic-openshift-node.service.d └─openshift-sdn-ovs.conf Active: active (running) since Tue 2019-06-11 20:04:39 UTC; 6 days ago Docs: https://github.com/openshift/origin Main PID: 17382 (openshift) Memory: 177.1M CGroup: /system.slice/atomic-openshift-node.service ├─17382 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2 └─17614 journalctl -k -f [TEST1][user@ose3-int-a-minion-i12 ~]$ systemctl status atomic-openshift-node ● atomic-openshift-node.service - OpenShift Node Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled) Drop-In: /usr/lib/systemd/system/atomic-openshift-node.service.d └─openshift-sdn-ovs.conf Active: active (running) since Tue 2019-06-11 20:04:40 UTC; 6 days ago Docs: https://github.com/openshift/origin Main PID: 13695 (openshift) Memory: 572.4M CGroup: /system.slice/atomic-openshift-node.service ├─13695 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2 └─13943 journalctl -k -f [TEST1][user@ose3-int-a-minion-i13 ~]$ systemctl status atomic-openshift-node ● atomic-openshift-node.service - OpenShift Node Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled) Drop-In: /usr/lib/systemd/system/atomic-openshift-node.service.d └─openshift-sdn-ovs.conf Active: active (running) since Tue 2019-06-11 20:04:39 UTC; 6 days ago Docs: https://github.com/openshift/origin Main PID: 5004 (openshift) Memory: 186.3M CGroup: /system.slice/atomic-openshift-node.service ├─5004 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2 └─5280 journalctl -k -f ``` It seems probable this is when the hostport rules were lost, i.e. didn't get reapplied on restart of the node.
I think I reproduced. Two applications: ruby-ex listening on port 8888 killme listening on port 9999 The goal was, at first, to kill the "killme" pods to see if losing their hostport would inadvertently affect the ruby-ex port. I find this did not happen. However, restarting the atomic-openshift-node service DOES result in the ruby-ex pod losing its hostport. Here's my data. Terminal 1 shows my work. In terminal 2, I have a loop checking iptables-save for rules around ports 8888 and 9999 # while true; do iptables-save | grep -e 8888 -e 9999 ; date; sleep 10; done ========================= //Scale down [quicklab@master-0 ~]$ oc get pod NAME READY STATUS RESTARTS AGE killme-4-grgpx 1/1 Running 0 21s ruby-ex-4-nhlf8 1/1 Running 0 6m [quicklab@master-0 ~]$ oc scale dc killme --replicas=0 deploymentconfig "killme" scaled [quicklab@master-0 ~]$ date Tue Jun 25 17:18:50 EDT 2019 We see port 9999 disappear after the scaledown, as expected. Port 8888 remains. -A KUBE-HOSTPORTS -p tcp -m comment --comment "ruby-ex-4-nhlf8_ruby hostport 8888" -m tcp --dport 8888 -j KUBE-HP-QI2R4GYTHZW5JXWS -A KUBE-HOSTPORTS -p tcp -m comment --comment "killme-4-grgpx_ruby hostport 9999" -m tcp --dport 9999 -j KUBE-HP-3LMI74HSZVOC4LTK -A KUBE-HP-3LMI74HSZVOC4LTK -s 10.129.0.11/32 -m comment --comment "killme-4-grgpx_ruby hostport 9999" -j KUBE-MARK-MASQ -A KUBE-HP-3LMI74HSZVOC4LTK -p tcp -m comment --comment "killme-4-grgpx_ruby hostport 9999" -m tcp -j DNAT --to-destination 10.129.0.11:8080 -A KUBE-HP-QI2R4GYTHZW5JXWS -s 10.129.0.9/32 -m comment --comment "ruby-ex-4-nhlf8_ruby hostport 8888" -j KUBE-MARK-MASQ -A KUBE-HP-QI2R4GYTHZW5JXWS -p tcp -m comment --comment "ruby-ex-4-nhlf8_ruby hostport 8888" -m tcp -j DNAT --to-destination 10.129.0.9:8080 Tue Jun 25 17:18:42 EDT 2019 -A KUBE-HOSTPORTS -p tcp -m comment --comment "ruby-ex-4-nhlf8_ruby hostport 8888" -m tcp --dport 8888 -j KUBE-HP-QI2R4GYTHZW5JXWS -A KUBE-HP-QI2R4GYTHZW5JXWS -s 10.129.0.9/32 -m comment --comment "ruby-ex-4-nhlf8_ruby hostport 8888" -j KUBE-MARK-MASQ -A KUBE-HP-QI2R4GYTHZW5JXWS -p tcp -m comment --comment "ruby-ex-4-nhlf8_ruby hostport 8888" -m tcp -j DNAT --to-destination 10.129.0.9:8080 Tue Jun 25 17:18:52 EDT 2019 ========================= Now, we'll restart the atomic-openshift-node service and check. [quicklab@node-0 ~]$ sudo systemctl restart atomic-openshift-node [quicklab@node-0 ~]$ date Tue Jun 25 17:21:48 EDT 2019 [quicklab@node-0 ~]$ sudo systemctl status atomic-openshift-node ● atomic-openshift-node.service - OpenShift Node Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled) Drop-In: /usr/lib/systemd/system/atomic-openshift-node.service.d └─openshift-sdn-ovs.conf Active: active (running) since Tue 2019-06-25 17:21:35 EDT; 29s ago -A KUBE-HOSTPORTS -p tcp -m comment --comment "ruby-ex-4-nhlf8_ruby hostport 8888" -m tcp --dport 8888 -j KUBE-HP-QI2R4GYTHZW5JXWS -A KUBE-HP-QI2R4GYTHZW5JXWS -s 10.129.0.9/32 -m comment --comment "ruby-ex-4-nhlf8_ruby hostport 8888" -j KUBE-MARK-MASQ -A KUBE-HP-QI2R4GYTHZW5JXWS -p tcp -m comment --comment "ruby-ex-4-nhlf8_ruby hostport 8888" -m tcp -j DNAT --to-destination 10.129.0.9:8080 Tue Jun 25 17:21:32 EDT 2019 Tue Jun 25 17:21:42 EDT 2019 Tue Jun 25 17:21:52 EDT 2019 ========================= Now, I scale up killme to see if hostports will be re-added. Note that ruby-ex-4-nhlf8 is still around, even though we no longer see its iptables entry. $ oc get pod ; date NAME READY STATUS RESTARTS AGE killme-1-build 0/1 Completed 0 28m killme-4-64tk9 0/1 ContainerCreating 0 4s ruby-ex-4-nhlf8 1/1 Running 0 12m Tue Jun 25 17:22:25 EDT 2019 Tue Jun 25 17:22:12 EDT 2019 Tue Jun 25 17:22:22 EDT 2019 -A KUBE-HOSTPORTS -p tcp -m comment --comment "killme-4-64tk9_ruby hostport 9999" -m tcp --dport 9999 -j KUBE-HP-SG4IGRHXJSQS2BXY -A KUBE-HP-SG4IGRHXJSQS2BXY -s 10.129.0.13/32 -m comment --comment "killme-4-64tk9_ruby hostport 9999" -j KUBE-MARK-MASQ -A KUBE-HP-SG4IGRHXJSQS2BXY -p tcp -m comment --comment "killme-4-64tk9_ruby hostport 9999" -m tcp -j DNAT --to-destination 10.129.0.13:8080 Tue Jun 25 17:22:32 EDT 2019 -A KUBE-HOSTPORTS -p tcp -m comment --comment "killme-4-64tk9_ruby hostport 9999" -m tcp --dport 9999 -j KUBE-HP-SG4IGRHXJSQS2BXY -A KUBE-HP-SG4IGRHXJSQS2BXY -s 10.129.0.13/32 -m comment --comment "killme-4-64tk9_ruby hostport 9999" -j KUBE-MARK-MASQ -A KUBE-HP-SG4IGRHXJSQS2BXY -p tcp -m comment --comment "killme-4-64tk9_ruby hostport 9999" -m tcp -j DNAT --to-destination 10.129.0.13:8080 Tue Jun 25 17:22:42 EDT 2019 ========================= Additional notes: - I use a nodeSelector to make sure these pods always run on node-0: $ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE killme-4-64tk9 1/1 Running 0 10m 10.129.0.13 node-0.datadyne.lab.pnq2.cee.redhat.com ruby-ex-4-nhlf8 1/1 Running 0 23m 10.129.0.9 node-0.datadyne.lab.pnq2.cee.redhat.com My service accounts have: oc describe scc hostnetwork Name: hostnetwork Priority: <none> Access: Users: system:serviceaccount:default:router,system:serviceaccount:default:registry,system:serviceaccount:ruby:default . . . $ oc get clusterrolebinding | grep router . . . router-router-role-0 /router-router-role ruby/default I think this is sufficient to show there might still be an issue. If you need help reproducing, want to re-reproduce it together, want me to show it live, or etc let me know -- hopefully we should be able to do this again at will. :) I used the default ruby example app which you can get with: $ oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git $ oc new-app --name=killme centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git I added a nodeSelector and updated the "port" section of the pod definition with these changes, respectively: $ oc get dc -o yaml | grep hostPort -C1 - containerPort: 8080 hostPort: 9999 protocol: TCP -- - containerPort: 8080 hostPort: 8888 protocol: TCP
Hi So I have been able to reproduce this on 3.11 The full sequence of events (as observed by me) are: 1) Y Pods with hostPorts assigned are running on host X. iptable rules are fine and well-defined 2) SDN systemd service dies on host X iptable rules remain correct once the SDN service is back up. 3) One pod dies and is re-spawned on host X iptable rules are inconsistent, all iptable rules are wiped and only the newly re-spawned pod rules are correct. ---- Conclusion: there definitively seems to be a link to the loss of the SDN process memory. This also only seems to affect pods with hostPorts assigned, I have run the same tests without that and iptable rules remain consistent. I will have a look and try to provide a fix ASAP.
*** Bug 1550659 has been marked as a duplicate of this bug. ***
Any updates on this? Looks like the PR is still open but as the customer has been affected by this for some time I'm wondering if we can push it.
Hi The PRs on the parent branches have been merged and verified as fixing the issue (see linked issue to this BZ) Final review of the 3.11 PR is on-going. The PR is a bit bigger, as the back-port to 3.11 required back-porting other things as well as to have this fix working. We hope to be able to merge the PR by the end of the week. Best regards Alexander
Hi Does this mean this will be fixed by errata https://errata.devel.redhat.com/advisory/47061? And will form part of v3.11.152? I am correct in guessing v3.11.152 hasn't been released yet? Cheers David
Hi Its been over a week since this was QA'ed, just following up on my previous question #29.. Is this going to make it into v3.11.152? When will v3.11.152 be released? I am trying to manage the customer expectations. Thanks David
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3139
*** Bug 1744077 has been marked as a duplicate of this bug. ***