Bug 1956535
Summary: | Multiple REDIRECT/DNAT IPTable rules on the node causing "connection refused" error while accessing the idled POD | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Swadeep Asthana <swasthan> |
Component: | Networking | Assignee: | Andrew Stoycos <astoycos> |
Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> |
Status: | CLOSED WORKSFORME | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | aconstan, amcdermo, anbhat, aojeagar, aos-bugs, arghosh, astoycos, bbennett, cdc, danw, emmanuel.quiroga, hgomes, mnunes |
Version: | 4.7 | Keywords: | Reopened |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-09-27 17:02:19 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1990016 | ||
Bug Blocks: |
Comment 1
Andrew McDermott
2021-05-04 14:14:17 UTC
*** This bug has been marked as a duplicate of bug 1953705 *** This is the correct bugzilla. Please keep open for follow up. This is not the same as 1953705 It's not clear if this is the same bug as 1953705 or not, so let's get the same info: It appears that if you idle a service with an "old" oc binary (4.6.16 or earlier, or most 4.7 alpha/beta builds) in a "new" cluster (4.6.17 or later, 4.7.0-rc.1 and later, or any 4.8 nightly) then it will not unidle correctly when it receives traffic. (openshift-sdn will emit the NeedPods event but the controller will not scale it up.) @swasthan, can you confirm the versions of OCP you are using and the version of the "oc" binary that you are using to idle to pods? ("oc version" will tell you both.) If you are using an "old" oc binary, then getting an updated binary should fix the bug. If not, then please create a new deployment and service (ie, one that has not been previously idled) and: 1. idle the service 2. get the output of "oc get service NAME -o yaml" and "oc get ep NAME -o yaml" 3. try to connect to the service 4. get the output of "oc get service NAME -o yaml" and "oc get ep NAME -o yaml" again 5. get the output of "oc get events -o yaml" 6. get the output of "oc get pods -n NAMESPACE" (to confirm whether pods have been recreated for the deployment) 7. tar/zip up all the files and attach them to this bug can you get a must-gather, after reproducing the bug? Okay, I've tracked this down to a bug in iptables. It is made more likely when there are more idled services, but the number of services doesn't matter. Example: # iptables -w 5 -W 100000 -C KUBE-PORTALS-HOST -t nat -m comment --comment foo/hello-43:hello1 -p tcp -m tcp --dport 80 -d 172.30.248.35/32 -j DNAT --to-destination 10.0.32.3:4685 (returns nothing, because the rule exists) However, if you do # while true; do iptables -w 5 -W 100000 -S OPENSHIFT-SDN-CANARY -t mangle > /dev/null; done in another terminal, you quickly get iptables: Bad rule (does a matching rule exist in that chain?). I'll file an upstream iptables-nft bug. |