Bug 1752646

Summary:	Conformance "should set Forwarded headers appropriately" router test is disabled
Product:	OpenShift Container Platform	Reporter:	Dan Mace <dmace>
Component:	Networking	Assignee:	Andrew McDermott <amcdermo>
Networking sub component:	router	QA Contact:	Mike Fiedler <mifiedle>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	high	CC:	amcdermo, anli, aos-bugs, ccoleman, danw, egarcia, m.andre, mifiedle
Version:	4.2.0
Target Milestone:	---
Target Release:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-01-23 11:06:16 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Dan Mace 2019-09-16 19:43:02 UTC

Description of problem:

The Conformance "should set Forwarded headers appropriately" router test is disabled because of some un-diagnosed issue in 4.x. Either fix the test and re-enable it, or propose deleting the test.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 egarcia 2019-09-17 14:36:37 UTC

From the original skip comment:

// check who can update TemplateInstances.  Via Update(), spec updates
// should be rejected (with the exception of spec.metadata fields used
// by the garbage collector, not tested here).  Status updates should be
// silently ignored.  Via UpdateStatus(), spec updates should be
// silently ignored.  Status should only be updatable by a user with
// update access to that endpoint.  In practice this is intended only to
// be the templateinstance controller and system:admin.

Comment 2 egarcia 2019-09-17 14:48:01 UTC

Moving the test to broken skip tag in test/extended/util/test: https://github.com/openshift/origin/pull/23814

Comment 3 Miciah Dashiel Butler Masters 2019-09-17 21:25:34 UTC

*** Bug 1748764 has been marked as a duplicate of this bug. ***

Comment 4 Andrew McDermott 2019-10-15 17:54:36 UTC

This test is fine on both GCP and Azure - the routing must be different on AWS. Do we want to enable this test again, but with an explicit skip for AWS?

Comment 5 Dan Mace 2019-10-15 18:51:49 UTC

(In reply to Andrew McDermott from comment #4)
> This test is fine on both GCP and Azure - the routing must be different on
> AWS. Do we want to enable this test again, but with an explicit skip for AWS?

I think for me to have an opinion here I would need a root cause analysis of the failure on AWS.

Comment 6 Andrew McDermott 2019-10-16 21:36:07 UTC

(In reply to Dan Mace from comment #5)
> (In reply to Andrew McDermott from comment #4)
> > This test is fine on both GCP and Azure - the routing must be different on
> > AWS. Do we want to enable this test again, but with an explicit skip for AWS?
> 
> I think for me to have an opinion here I would need a root cause analysis of
> the failure on AWS.

TL;DR - on AWS the route to the router/loadbalancer seems to go out of the cluster (and back again) whereas on GCP and Azure it doesn't yielding the origin IP address to be the POD's IP address and that is what the test asserts for.

###
### GCP
###

On GCP I see the following:

KUBECONFIG=~/amcdermo-gcp-kubeconfig oc get service -n openshift-ingress -o yaml router-default

status:
  loadBalancer:
    ingress:
    - ip: 35.223.10.172

# traceroute 35.223.10.172
traceroute to 35.223.10.172 (35.223.10.172), 30 hops max, 60 byte packets
 1  172.10.223.35.bc.googleusercontent.com (35.223.10.172)  0.073 ms  0.027 ms  0.019 ms

# ip route get 35.223.10.172
local 35.223.10.172 dev lo src 35.223.10.172 uid 0 
    cache <local> 

The test runs like this:

STEP: making a request and reading back the echoed headers
Oct 16 20:53:39.932: INFO: Running '/usr/bin/kubectl --server=https://api.amcdermo.gcp.devcluster.openshift.com:6443 --kubeconfig=/home/
aim/go-projects/openshift-installer/src/github.com/openshift/installer/amcdermo-gcp/auth/kubeconfig exec --namespace=e2e-test-router-hea
ders-n7s5n execpod -- /bin/sh -x -c 
                set -e
                payload=$( curl -s --header 'Host: router-headers.example.com' "http://35.223.10.172" ) || rc=$?
                if [[ "${rc:-0}" -eq 0 ]]; then
                        printf "${payload}"
                        exit 0
                else
                        echo "error ${rc}" 1>&2
                        exit 1
                fi
                '

I changed the e2e test to sleep forever so that I could rsh to the execpod and run the test manually:

$ oc get pods --all-namespaces -o wide  |grep test-
e2e-test-router-headers-n7s5n                           execpod                                                                    1/1       Running     0          75m       10.131.0.33   amcder-s82x6-w-a-lg9jj.c.openshift-gce-devel.internal   <none>           <none>
e2e-test-router-headers-n7s5n                           router-http-echo-1-deploy                                                  0/1       Completed   0          75m       10.131.0.34   amcder-s82x6-w-a-lg9jj.c.openshift-gce-devel.internal   <none>           <none>
e2e-test-router-headers-n7s5n                           router-http-echo-1-l5vq5                                                   1/1       Running     0          75m       10.129.2.13   amcder-s82x6-w-c-zxhbz.c.openshift-gce-devel.internal   <none>           <none>

$ oc rsh -n e2e-test-router-headers-n7s5n execpod
/ # curl -s --header "Host: router-headers.example.com" http://35.223.10.172
GET / HTTP/1.1
User-Agent: curl/7.61.1
Accept: */*
Host: router-headers.example.com
X-Forwarded-Host: router-headers.example.com
X-Forwarded-Port: 80
X-Forwarded-Proto: http
Forwarded: for=10.131.0.33;host=router-headers.example.com;proto=http;proto-version=""
X-Forwarded-For: 10.131.0.33

and "10.131.0.33" is the expected result as it matches the POD IP above.

###
### AWS
###

And doing this on AWS I see:

KUBECONFIG=~/amcdermo-aws-kubeconfig oc get service -n openshift-ingress -o yaml router-default

status:
  loadBalancer:
    ingress:
    - hostname: aec9b866ef7ea435397684e196095c95-245284306.us-east-1.elb.amazonaws.com

# dig +short aec9b866ef7ea435397684e196095c95-245284306.us-east-1.elb.amazonaws.com
18.213.226.39

# ip r 
default via 10.0.128.1 dev ens3 
default via 10.0.128.1 dev ens3 proto dhcp metric 100 
10.0.128.0/20 dev ens3 proto kernel scope link src 10.0.131.234 
10.0.128.0/20 dev ens3 proto kernel scope link src 10.0.131.234 metric 100 
10.128.0.0/14 dev tun0 scope link 
172.30.0.0/16 dev tun0 

# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.0.128.1      0.0.0.0         UG    0      0        0 ens3
0.0.0.0         10.0.128.1      0.0.0.0         UG    100    0        0 ens3
10.0.128.0      0.0.0.0         255.255.240.0   U     0      0        0 ens3
10.0.128.0      0.0.0.0         255.255.240.0   U     100    0        0 ens3
10.128.0.0      0.0.0.0         255.252.0.0     U     0      0        0 tun0
172.30.0.0      0.0.0.0         255.255.0.0     U     0      0        0 tun0

# ip r get 18.213.226.39
18.213.226.39 via 10.0.128.1 dev ens3 src 10.0.131.234 uid 0 

# traceroute to 18.213.226.39 (18.213.226.39), 30 hops max, 60 byte packets
 1  ip-10-0-11-92.ec2.internal (10.0.11.92)  0.701 ms  0.641 ms  0.601 ms
 2  216.182.231.106 (216.182.231.106)  13.045 ms 216.182.229.82 (216.182.229.82)  5.698 ms 216.182.226.20 (216.182.226.20)  16.894 ms

So access to the router goes outside of the cluster and back in again.

Whois 216.182.226.20?

NetRange:       216.182.224.0 - 216.182.239.255
CIDR:           216.182.224.0/20
NetName:        AMAZON-AES

So given that the test effectively runs as:

# curl -s --header "Host: router-headers.example.com" "http://aec9b866ef7ea435397684e196095c95-245284306.us-east-1.elb.amazonaws.com

it doesn't matter where we run it from, because:

# oc rsh netshoot
# curl -s --header "Host: router-headers.example.com" "http://aec9b866ef7ea435397684e196095c95-245284306.us-east-1.elb.amazonaws.com"
GET / HTTP/1.1
User-Agent: curl/7.65.1
Accept: */*
Host: router-headers.example.com
X-Forwarded-Host: router-headers.example.com
X-Forwarded-Port: 80
X-Forwarded-Proto: http
Forwarded: for=35.170.15.97;host=router-headers.example.com;proto=http;proto-version=""
X-Forwarded-For: 35.170.15.97

# dig -x 35.170.15.97
;; ANSWER SECTION:
97.15.170.35.in-addr.arpa. 60	IN	PTR	ec2-35-170-15-97.compute-1.amazonaws.com.

And 35.170.15.97 is an elastic-ip ("amcdermo-bswxd-eip-us-east-1c").

###
### Conclusion
### 

From a testing point of view should this test assert that the origin in the X-Forward-For header is actually the PODs IP address or simply that there is an X-Forward-For header (and others) without considering its value? Or is the routing either borked or works as expected on AWS?

Comment 7 Andrew McDermott 2019-10-16 21:59:03 UTC

Also see: https://github.com/openshift/origin/pull/23688/commits/769f92152edd6e002902d9fc7e0c33960040d0c7

Comment 8 Andrew McDermott 2019-10-16 22:16:47 UTC

And also see: https://github.com/kubernetes/kubernetes/issues/57250

Comment 9 Andrew McDermott 2019-11-05 15:51:02 UTC

https://github.com/openshift/origin/pull/24093

Comment 11 Martin André 2019-11-13 16:57:18 UTC

OpenStack CI is now failing the HAProxy router header test after https://github.com/openshift/origin/pull/24093 re-enabled it.

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.3/400/build-log.txt

Comment 12 Andrew McDermott 2019-11-13 18:43:26 UTC

(In reply to Martin André from comment #11)
> OpenStack CI is now failing the HAProxy router header test after
> https://github.com/openshift/origin/pull/24093 re-enabled it.
> 
> https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-
> installer-e2e-openstack-4.3/400/build-log.txt

See also: https://bugzilla.redhat.com/show_bug.cgi?id=1772125

Comment 13 Mike Fiedler 2019-11-16 15:37:28 UTC

Verified running openshift/conformance/parallel on an AWS cluster:

started: (15/1659/2119) "[Conformance][Area:Networking][Feature:Router] The HAProxy router should set Forwarded headers appropriately [Suite:openshift/conformance/parallel/minimal]"
passed: (1m3s) 2019-11-16T00:08:49 "[Conformance][Area:Networking][Feature:Router] The HAProxy router should set Forwarded headers appropriately [Suite:openshift/conformance/parallel/minimal]"


4.3.0-0.nightly-2019-11-13-233341

Comment 14 Clayton Coleman 2019-12-04 03:41:01 UTC

I am extremely concerned that this test is not verifying that source ip is preserved.  That's a key feature and behavior of the router (it's why we enabled proxy protocol) and it should work on AWS.

We need to find a way to consistently test the feature as part of this e2e test.

Comment 16 errata-xmlrpc 2020-01-23 11:06:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Comment 17 Dan Winship 2020-04-09 21:09:52 UTC

(In reply to Clayton Coleman from comment #14)
> I am extremely concerned that this test is not verifying that source ip is
> preserved.  That's a key feature and behavior of the router (it's why we
> enabled proxy protocol) and it should work on AWS.
> 
> We need to find a way to consistently test the feature as part of this e2e
> test.

The test case as written is checking that if a pod within the cluster connects to a service within the cluster via the router, then the client IP is preserved. I assume that what we actually care about is what happens when a client *outside the cluster* connects to the router? This test is *not* testing that.

In the GCP/Azure case, when the CloudProvider sets a loadbalancer on a service, it updates the service.Status.LoadBalancer to have the load balancer's IP. Kube-proxy sees this and then creates a short-circuit rule redirecting traffic directly from the load balancer IP to the service IP. So when a pod in a GCP or Azure cluster tries to connect to a cloud loadbalancer, as in this test case, the connection never actually goes to the cloud load balancer; it is intercepted by iptables rules on the node and forwarded directly to the service endpoint. So you are not testing the client-IP-preserving behavior of the load balancer, you're testing the client-IP-preserving behavior of kube-proxy.

With AWS, the CloudProvider does not set the loadbalancer IP on service.Status.LoadBalancer, so kube-proxy can't create the short-circuit rules, and so when a pod tries to connect to a load balancer, the connection leaves the node, gets NATted to the node IP, hits the cloud load balancer, and gets redirected back to the cluster from there. Assuming that the cloud load balancer preserves the client IP correctly, the destination pod should see X-Forwarded-For the node IP of the execpod rather than its pod IP.


It would be better to have the test connect to the router directly from the openshift-tests process, rather that connecting from a pod inside the cluster being tested. Then you'd always be hitting the cloud load balancer. The problem with this would be figuring out what client IP is actually expected in that case...

Comment 18 Clayton Coleman 2020-04-09 21:14:55 UTC

Yeah, this would be like the upgrade availability tests.  We probably need a rule that if the test process can't access that, the test is meaningless and should be excluded from whatever weird environment that is (private clusters need to be tested via a tunnel anyway).