Bug 1723924
| Summary: | Unexpected loss of pod hostports | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Steven Walter <stwalter> | |
| Component: | Networking | Assignee: | Alexander Constantinescu <aconstan> | |
| Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | high | |||
| Priority: | unspecified | CC: | aos-bugs, bbennett, cdc, dsquirre, dzhukous, huirwang, mfuruta, nbhatt, rhowe, rvokal | |
| Version: | 3.7.1 | |||
| Target Milestone: | --- | |||
| Target Release: | 3.11.z | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1740731 1740741 (view as bug list) | Environment: | ||
| Last Closed: | 2019-10-18 01:34:36 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1740741 | |||
| Bug Blocks: | 1698629 | |||
|
Description
Steven Walter
2019-06-25 18:46:03 UTC
I think I reproduced.
Two applications:
ruby-ex listening on port 8888
killme listening on port 9999
The goal was, at first, to kill the "killme" pods to see if losing their hostport would inadvertently affect the ruby-ex port. I find this did not happen.
However, restarting the atomic-openshift-node service DOES result in the ruby-ex pod losing its hostport.
Here's my data.
Terminal 1 shows my work. In terminal 2, I have a loop checking iptables-save for rules around ports 8888 and 9999
# while true; do iptables-save | grep -e 8888 -e 9999 ; date; sleep 10; done
=========================
//Scale down
[quicklab@master-0 ~]$ oc get pod
NAME READY STATUS RESTARTS AGE
killme-4-grgpx 1/1 Running 0 21s
ruby-ex-4-nhlf8 1/1 Running 0 6m
[quicklab@master-0 ~]$ oc scale dc killme --replicas=0
deploymentconfig "killme" scaled
[quicklab@master-0 ~]$ date
Tue Jun 25 17:18:50 EDT 2019
We see port 9999 disappear after the scaledown, as expected. Port 8888 remains.
-A KUBE-HOSTPORTS -p tcp -m comment --comment "ruby-ex-4-nhlf8_ruby hostport 8888" -m tcp --dport 8888 -j KUBE-HP-QI2R4GYTHZW5JXWS
-A KUBE-HOSTPORTS -p tcp -m comment --comment "killme-4-grgpx_ruby hostport 9999" -m tcp --dport 9999 -j KUBE-HP-3LMI74HSZVOC4LTK
-A KUBE-HP-3LMI74HSZVOC4LTK -s 10.129.0.11/32 -m comment --comment "killme-4-grgpx_ruby hostport 9999" -j KUBE-MARK-MASQ
-A KUBE-HP-3LMI74HSZVOC4LTK -p tcp -m comment --comment "killme-4-grgpx_ruby hostport 9999" -m tcp -j DNAT --to-destination 10.129.0.11:8080
-A KUBE-HP-QI2R4GYTHZW5JXWS -s 10.129.0.9/32 -m comment --comment "ruby-ex-4-nhlf8_ruby hostport 8888" -j KUBE-MARK-MASQ
-A KUBE-HP-QI2R4GYTHZW5JXWS -p tcp -m comment --comment "ruby-ex-4-nhlf8_ruby hostport 8888" -m tcp -j DNAT --to-destination 10.129.0.9:8080
Tue Jun 25 17:18:42 EDT 2019
-A KUBE-HOSTPORTS -p tcp -m comment --comment "ruby-ex-4-nhlf8_ruby hostport 8888" -m tcp --dport 8888 -j KUBE-HP-QI2R4GYTHZW5JXWS
-A KUBE-HP-QI2R4GYTHZW5JXWS -s 10.129.0.9/32 -m comment --comment "ruby-ex-4-nhlf8_ruby hostport 8888" -j KUBE-MARK-MASQ
-A KUBE-HP-QI2R4GYTHZW5JXWS -p tcp -m comment --comment "ruby-ex-4-nhlf8_ruby hostport 8888" -m tcp -j DNAT --to-destination 10.129.0.9:8080
Tue Jun 25 17:18:52 EDT 2019
=========================
Now, we'll restart the atomic-openshift-node service and check.
[quicklab@node-0 ~]$ sudo systemctl restart atomic-openshift-node
[quicklab@node-0 ~]$ date
Tue Jun 25 17:21:48 EDT 2019
[quicklab@node-0 ~]$ sudo systemctl status atomic-openshift-node
● atomic-openshift-node.service - OpenShift Node
Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/atomic-openshift-node.service.d
└─openshift-sdn-ovs.conf
Active: active (running) since Tue 2019-06-25 17:21:35 EDT; 29s ago
-A KUBE-HOSTPORTS -p tcp -m comment --comment "ruby-ex-4-nhlf8_ruby hostport 8888" -m tcp --dport 8888 -j KUBE-HP-QI2R4GYTHZW5JXWS
-A KUBE-HP-QI2R4GYTHZW5JXWS -s 10.129.0.9/32 -m comment --comment "ruby-ex-4-nhlf8_ruby hostport 8888" -j KUBE-MARK-MASQ
-A KUBE-HP-QI2R4GYTHZW5JXWS -p tcp -m comment --comment "ruby-ex-4-nhlf8_ruby hostport 8888" -m tcp -j DNAT --to-destination 10.129.0.9:8080
Tue Jun 25 17:21:32 EDT 2019
Tue Jun 25 17:21:42 EDT 2019
Tue Jun 25 17:21:52 EDT 2019
=========================
Now, I scale up killme to see if hostports will be re-added. Note that ruby-ex-4-nhlf8 is still around, even though we no longer see its iptables entry.
$ oc get pod ; date
NAME READY STATUS RESTARTS AGE
killme-1-build 0/1 Completed 0 28m
killme-4-64tk9 0/1 ContainerCreating 0 4s
ruby-ex-4-nhlf8 1/1 Running 0 12m
Tue Jun 25 17:22:25 EDT 2019
Tue Jun 25 17:22:12 EDT 2019
Tue Jun 25 17:22:22 EDT 2019
-A KUBE-HOSTPORTS -p tcp -m comment --comment "killme-4-64tk9_ruby hostport 9999" -m tcp --dport 9999 -j KUBE-HP-SG4IGRHXJSQS2BXY
-A KUBE-HP-SG4IGRHXJSQS2BXY -s 10.129.0.13/32 -m comment --comment "killme-4-64tk9_ruby hostport 9999" -j KUBE-MARK-MASQ
-A KUBE-HP-SG4IGRHXJSQS2BXY -p tcp -m comment --comment "killme-4-64tk9_ruby hostport 9999" -m tcp -j DNAT --to-destination 10.129.0.13:8080
Tue Jun 25 17:22:32 EDT 2019
-A KUBE-HOSTPORTS -p tcp -m comment --comment "killme-4-64tk9_ruby hostport 9999" -m tcp --dport 9999 -j KUBE-HP-SG4IGRHXJSQS2BXY
-A KUBE-HP-SG4IGRHXJSQS2BXY -s 10.129.0.13/32 -m comment --comment "killme-4-64tk9_ruby hostport 9999" -j KUBE-MARK-MASQ
-A KUBE-HP-SG4IGRHXJSQS2BXY -p tcp -m comment --comment "killme-4-64tk9_ruby hostport 9999" -m tcp -j DNAT --to-destination 10.129.0.13:8080
Tue Jun 25 17:22:42 EDT 2019
=========================
Additional notes:
- I use a nodeSelector to make sure these pods always run on node-0:
$ oc get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE
killme-4-64tk9 1/1 Running 0 10m 10.129.0.13 node-0.datadyne.lab.pnq2.cee.redhat.com
ruby-ex-4-nhlf8 1/1 Running 0 23m 10.129.0.9 node-0.datadyne.lab.pnq2.cee.redhat.com
My service accounts have:
oc describe scc hostnetwork
Name: hostnetwork
Priority: <none>
Access:
Users: system:serviceaccount:default:router,system:serviceaccount:default:registry,system:serviceaccount:ruby:default
. . .
$ oc get clusterrolebinding | grep router
. . .
router-router-role-0 /router-router-role ruby/default
I think this is sufficient to show there might still be an issue. If you need help reproducing, want to re-reproduce it together, want me to show it live, or etc let me know -- hopefully we should be able to do this again at will. :)
I used the default ruby example app which you can get with:
$ oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git
$ oc new-app --name=killme centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git
I added a nodeSelector and updated the "port" section of the pod definition with these changes, respectively:
$ oc get dc -o yaml | grep hostPort -C1
- containerPort: 8080
hostPort: 9999
protocol: TCP
--
- containerPort: 8080
hostPort: 8888
protocol: TCP
Hi So I have been able to reproduce this on 3.11 The full sequence of events (as observed by me) are: 1) Y Pods with hostPorts assigned are running on host X. iptable rules are fine and well-defined 2) SDN systemd service dies on host X iptable rules remain correct once the SDN service is back up. 3) One pod dies and is re-spawned on host X iptable rules are inconsistent, all iptable rules are wiped and only the newly re-spawned pod rules are correct. ---- Conclusion: there definitively seems to be a link to the loss of the SDN process memory. This also only seems to affect pods with hostPorts assigned, I have run the same tests without that and iptable rules remain consistent. I will have a look and try to provide a fix ASAP. *** Bug 1550659 has been marked as a duplicate of this bug. *** Any updates on this? Looks like the PR is still open but as the customer has been affected by this for some time I'm wondering if we can push it. Hi The PRs on the parent branches have been merged and verified as fixing the issue (see linked issue to this BZ) Final review of the 3.11 PR is on-going. The PR is a bit bigger, as the back-port to 3.11 required back-porting other things as well as to have this fix working. We hope to be able to merge the PR by the end of the week. Best regards Alexander Hi Does this mean this will be fixed by errata https://errata.devel.redhat.com/advisory/47061? And will form part of v3.11.152? I am correct in guessing v3.11.152 hasn't been released yet? Cheers David Hi Its been over a week since this was QA'ed, just following up on my previous question #29.. Is this going to make it into v3.11.152? When will v3.11.152 be released? I am trying to manage the customer expectations. Thanks David Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3139 *** Bug 1744077 has been marked as a duplicate of this bug. *** |