Bug 1903206
Summary: | Ingress controller incorrectly routes traffic to non-ready pods/backends. | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | emahoney |
Component: | Networking | Assignee: | Andrew McDermott <amcdermo> |
Networking sub component: | router | QA Contact: | Arvind iyengar <aiyengar> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | urgent | ||
Priority: | urgent | CC: | aiyengar, amcdermo, aos-bugs, aos-network-edge-staff, arthur.barr, chris.wilkinson, cmarches, hongli, mmasters, openshift-bugs-escalate, rcarrier |
Version: | 4.6 | ||
Target Milestone: | --- | ||
Target Release: | 4.7.0 | ||
Hardware: | All | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-02-24 15:37:08 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1904010 |
Description
emahoney
2020-12-01 16:10:19 UTC
Hello Team, Could you please prioritize this bugzilla, because of the high impact Customer showed us and also this case has high visibility to the management on both sides? Thanks in advance for your efforts and support. Kind regards, Roberto Carrieri Escalation Manager Customer Experience & Engagement Mobile: +420.702.269.469 Will look into this immediately. Is there any update on this issue, please? Have you managed to reproduce it? Very happy to perform additional diagnostics, but hopefully you can re-create based on the above. I'm a little concerned that the "Target Release" has been set to 4.7.0, as we really need to see a fix on OCP 4.6.x, as this appears to be regressed/changed behaviour. I can reproduce this. This was broken by the switch to endpointslices https://github.com/openshift/router/pull/154 which happened in 4.6. Investigating a fix and will then back port to 4.6. (In reply to Arthur Barr from comment #3) > Is there any update on this issue, please? Have you managed to reproduce > it? Very happy to perform additional diagnostics, but hopefully you can > re-create based on the above. > > I'm a little concerned that the "Target Release" has been set to 4.7.0, as > we really need to see a fix on OCP 4.6.x, as this appears to be > regressed/changed behaviour. The procedure would mean that we first make the fix in 4.7 and then backport to 4.6. I plan to have a fix up for review for 4.7 today. Thanks very much for the update. Assuming this fix is accepted for 4.7, can you give any indication of a timeline for a fix on 4.6? Any information would be appreciated. (In reply to Arthur Barr from comment #6) > Thanks very much for the update. Assuming this fix is accepted for 4.7, can > you give any indication of a timeline for a fix on 4.6? Any information > would be appreciated. I just POSTed the PR: https://github.com/openshift/router/pull/229 If this gets reviewed and merged into 4.7 today then I can start the cherry-pick for 4.6. Once picked for 4.6 that needs approval for a 4.6.z stream which may happen tomorrow. Failing that it would be end of next week. Once it is merged into 4.7 I can give a better estimate. Moving this back to POST as https://github.com/openshift/router/pull/231 needs to be part of the overall change. I was adding tests to origin/e2e to verify the change but that is overkill. PR #231 adds unit tests to the router. The PR merge made into "4.7.0-0.nightly-2020-12-03-141554" release version. With this payload, it is noted that the fix effectively resolves the problem where the PODs when in a "Not ready" state, the haproxy configuration has an empty backend pool and curl to the external route fails as expected. When one or all the pods are available and in a ready state, The haproxy backend pool gets populated with the entries of the ready pod, and the external route traffic is sent to the specifically ready pods only. * With no pods in the "ready" state: ----- $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2020-12-03-141554 True False 64m Cluster version is 4.7.0-0.nightly-2020-12-03-141554 $ oc create -f nginx-demoshell-deployment.yaml deployment.apps/nginxdemoshello-deployment created $ oc create -f nginx-demoshell-service.yaml service/nginxdemoshello-service created $ oc create -f nginx-demoshell-route.yaml route.route.openshift.io/nginxdemoshello-route created $ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginxdemoshello-deployment-5b46f96478-gwlc7 0/1 Running 0 41s 10.131.0.49 ip-10-0-164-195.us-east-2.compute.internal <none> <none> nginxdemoshello-deployment-5b46f96478-n6wlm 0/1 Running 0 41s 10.129.2.32 ip-10-0-199-23.us-east-2.compute.internal <none> <none> The haproxy configuration does not contain the backend entries. The non-ready pods are now not added in the pool and curl to route now fails with no backend pods in the "ready" state: backend be_http:test1:nginxdemoshello-route mode http option redispatch option forwardfor balance leastconn timeout check 5000ms http-request add-header X-Forwarded-Host %[req.hdr(host)] http-request add-header X-Forwarded-Port %[dst_port] http-request add-header X-Forwarded-Proto http if !{ ssl_fc } http-request add-header X-Forwarded-Proto https if { ssl_fc } http-request add-header X-Forwarded-Proto-Version h2 if { ssl_fc_alpn -i h2 } http-request add-header Forwarded for=%[src];host=%[req.hdr(host)];proto=%[req.hdr(X-Forwarded-Proto)] cookie 1384d216b7b1811db4625b94ff95ea56 insert indirect nocache httponly Route is unreachble at this time: $ curl nginxdemoshello-drb-test1.apps.aiyengar-oc47-1903206.qe.devcluster.openshift.com -I HTTP/1.0 503 Service Unavailable Pragma: no-cache Cache-Control: private, max-age=0, no-cache, no-store Connection: close Content-Type: text/html ----- * With one pod set to ready state: ------ $ oc exec nginxdemoshello-deployment-5b46f96478-gwlc7 -- touch /tmp/ready $ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginxdemoshello-deployment-5b46f96478-gwlc7 1/1 Running 0 16m 10.131.0.49 ip-10-0-164-195.us-east-2.compute.internal <none> <none> nginxdemoshello-deployment-5b46f96478-n6wlm 0/1 Running 0 16m 10.129.2.32 ip-10-0-199-23.us-east-2.compute.internal <none> <none> Entry gets added in the haproxy backend config: $ oc -n openshift-ingress exec router-default-6458cc5549-hfh6z -- grep -i "nginxdemoshello-route" haproxy.config -A15 backend be_http:test1:nginxdemoshello-route mode http option redispatch option forwardfor balance leastconn timeout check 5000ms http-request add-header X-Forwarded-Host %[req.hdr(host)] http-request add-header X-Forwarded-Port %[dst_port] http-request add-header X-Forwarded-Proto http if !{ ssl_fc } http-request add-header X-Forwarded-Proto https if { ssl_fc } http-request add-header X-Forwarded-Proto-Version h2 if { ssl_fc_alpn -i h2 } http-request add-header Forwarded for=%[src];host=%[req.hdr(host)];proto=%[req.hdr(X-Forwarded-Proto)] cookie 1384d216b7b1811db4625b94ff95ea56 insert indirect nocache httponly server pod:nginxdemoshello-deployment-5b46f96478-gwlc7:nginxdemoshello-service::10.131.0.49:8080 10.131.0.49:8080 cookie 1ecb3172ae59bdbf44242ac3d7873732 weight 256 The curl traffic now hits the "Ready" state pod only: $ curl nginxdemoshello-drb-test1.apps.aiyengar-oc47-1903206.qe.devcluster.openshift.com -I HTTP/1.1 200 OK Server: nginx/1.16.1 Date: Fri, 04 Dec 2020 08:47:00 GMT Content-Type: text/plain Content-Length: 175 Expires: Fri, 04 Dec 2020 08:46:59 GMT Cache-Control: no-cache Set-Cookie: 1384d216b7b1811db4625b94ff95ea56=1ecb3172ae59bdbf44242ac3d7873732; path=/; HttpOnly $ curl nginxdemoshello-drb-test1.apps.aiyengar-oc47-1903206.qe.devcluster.openshift.com Server address: 10.131.0.49:8080 Server name: nginxdemoshello-deployment-5b46f96478-gwlc7 Date: 04/Dec/2020:08:47:02 +0000 URI: / Request ID: f05bee9b4cf1bab32facfb9937e9e602 $ curl nginxdemoshello-drb-test1.apps.aiyengar-oc47-1903206.qe.devcluster.openshift.com Server address: 10.131.0.49:8080 Server name: nginxdemoshello-deployment-5b46f96478-gwlc7 Date: 04/Dec/2020:08:47:04 +0000 URI: / Request ID: 5c6816d9f46ae794e8ea5017a122843e ------ Now that this is merged to 4.7, are you able to indicate which 4.6.x update this fix will be targeted at? (In reply to chris.wilkinson.com from comment #12) > Now that this is merged to 4.7, are you able to indicate which 4.6.x update > this fix will be targeted at? 4.6.8. Currently waiting for the following PRs to merge in the 4.6 release branch: - https://github.com/openshift/router/pull/230 - https://github.com/openshift/router/pull/232 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |