Bug 1812960 - e2e-azure-ovn failing consistently: The HAProxy router should set Forwarded headers appropriately
Summary: e2e-azure-ovn failing consistently: The HAProxy router should set Forwarded h...
Keywords:
Status: CLOSED DUPLICATE of bug 1802311
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.5.0
Assignee: Phil Cameron
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On: 1788309
Blocks: 1802311
TreeView+ depends on / blocked
 
Reported: 2020-03-12 15:14 UTC by Corey Daley
Modified: 2022-03-21 10:49 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1788309
Environment:
Last Closed: 2020-05-20 13:34:57 UTC
Target Upstream Version:
Embargoed:
cdaley: needinfo-
cdaley: needinfo-


Attachments (Terms of Use)

Description Corey Daley 2020-03-12 15:14:10 UTC
[Conformance][Area:Networking][Feature:Router] The HAProxy router [Top Level] [Conformance][Area:Networking][Feature:Router] The HAProxy router should set Forwarded headers appropriately [Suite:openshift/conformance/parallel/minimal] expand_less 	1m38s
fail [github.com/openshift/origin/test/extended/router/headers.go:183]: Mar 12 09:21:01.087: Unexpected header: '100.64.4.1' (expected 10.128.2.12); All headers: http.Header{"Accept":[]string{"*/*"}, "Forwarded":[]string{"for=100.64.4.1;host=router-headers.example.com;proto=http;proto-version=\"\""}, "User-Agent":[]string{"curl/7.61.1"}, "X-Forwarded-For":[]string{"100.64.4.1"}, "X-Forwarded-Host":[]string{"router-headers.example.com"}, "X-Forwarded-Port":[]string{"80"}, "X-Forwarded-Proto":[]string{"http"}}

Example failure:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-ovn-4.4/1190

Comment 1 Ben Parees 2020-03-12 15:24:46 UTC
These jobs/tests need to be fixed or temporarily removed from our release-informing list

Comment 4 W. Trevor King 2020-03-17 22:38:41 UTC
(In reply to Ben Parees from comment #1)
> These jobs/tests need to be fixed or temporarily removed from our
> release-informing list

Yes please :).  Surprising to me that this doesn't happen on AWS; only on Azure and GCP:

$ curl -s 'https://search.svc.ci.openshift.org/search?name=^release-openshift-ocp-installer-.*-4.4&search=failed:+.*The+HAProxy+router+should+set+Forwarded+headers+appropriately' | jq -r '. | keys[]' | sed 's|/[^/]*$||' | sort | uniq -c
     16 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-ovn-4.4
     15 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.4

Comment 5 Dan Winship 2020-03-17 22:45:57 UTC
> Surprising to me that this doesn't happen on AWS; only on Azure and GCP

K8s supports two different ways of doing cloud loadbalancers; AWS does it one way and Azure/GCP do it the other way.

(The Azure/GCP way allows "direct return" of traffic:

    client -> load balancer -> node -> pod -> node -> client

whereas the AWS way does not:

    client -> load balancer -> node -> pod -> node -> load balancer -> client

)

Comment 7 W. Trevor King 2020-03-17 23:47:39 UTC
OVN also failing for 4.5 via bug 1814460. Per comment 1, can we drop the OVN release informers until these get sorted out?

Comment 8 Dan Winship 2020-03-18 01:55:31 UTC
> Per comment 1, can we drop the OVN release informers until these get sorted out?

sure. I have no idea how to do that

Comment 9 W. Trevor King 2020-03-18 19:54:26 UTC
I've filed [1] with a stab at removing the OVN release informers from 4.4+.

[1]: https://github.com/openshift/release/pull/7768

Comment 10 Miciah Dashiel Butler Masters 2020-03-31 16:03:57 UTC
This bug looks like a duplicate of bug 1802311.

In https://github.com/openshift/origin/pull/24764, I modified the failing test by adding a skip when the network plugin is OVNKubernetes; is that equivalent to removing OVN release informers?  In bug 1802311,  comment 9, Clayton said, "This bug may not be deferred from 4.5 without a root cause", which I assume would apply to dropping the OVN release informers as well.

Comment 14 Dan Mace 2020-04-09 18:50:35 UTC
Quick update. Source IP is definitely getting lost on Azure when using OVN. Same OCP version on Azure with SDN works fine.

With OVN, the backend of a route (i.e. a server) is seeing various source IPs (via the `Forwarded` header) coming from the 100.64.0.0/10 (RFC 6598) CIDR[1] instead of the actual client source IP. This is the same regardless of whether the client is from the internet, or inside the cluster VPC (e.g. from a master node). Given the Azure Load Balancer, virtual networks, security groups, etc. are identical between these clusters, there's almost certainly something wrong with OVN in this context. Need to dig deeper.

[1] https://docs.microsoft.com/en-us/azure/virtual-network/virtual-networks-udr-overview

Comment 15 Dan Mace 2020-04-09 19:27:18 UTC
And just a shallow bit of diagnostic info for anybody picking this up... iptables and other info from the OVN and SDN clusters:

### OVN implementation

router nodePort: 31075
router loadBalancer: 52.143.243.23

*nat

-A OVN-KUBE-NODEPORT -d 52.143.243.23/32 -p tcp -m tcp --dport 80 -j DNAT --to-destination 169.254.33.2:31075
-A OVN-KUBE-NODEPORT -p tcp -m tcp --dport 31075 -j DNAT --to-destination 169.254.33.2:31075

*filter

-A OVN-KUBE-NODEPORT -p tcp -m tcp --dport 31075 -j ACCEPT

7: br-nexthop: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 00:00:a9:fe:21:01 brd ff:ff:ff:ff:ff:ff
    inet 169.254.33.1/24 scope global br-nexthop
       valid_lft forever preferred_lft forever


### SDN implementation

nodePort 31177
loadBalancer 13.86.4.208

*nat

-A KUBE-SERVICES -d 13.86.4.208/32 -p tcp -m comment --comment "openshift-ingress/router-default:http loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-HEVFQXAKPPGAL4BV

-A KUBE-NODEPORTS -s 127.0.0.0/8 -p tcp -m comment --comment "openshift-ingress/router-default:http" -m tcp --dport 31177 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "openshift-ingress/router-default:http" -m tcp --dport 31177 -j KUBE-XLB-HEVFQXAKPPGAL4BV

-A KUBE-FW-HEVFQXAKPPGAL4BV -m comment --comment "openshift-ingress/router-default:http loadbalancer IP" -j KUBE-XLB-HEVFQXAKPPGAL4BV
-A KUBE-FW-HEVFQXAKPPGAL4BV -m comment --comment "openshift-ingress/router-default:http loadbalancer IP" -j KUBE-MARK-DROP

-A KUBE-XLB-HEVFQXAKPPGAL4BV -s 10.128.0.0/14 -m comment --comment "Redirect pods trying to reach external loadbalancer VIP to clusterIP" -j KUBE-SVC-HEVFQXAKPPGAL4BV
-A KUBE-XLB-HEVFQXAKPPGAL4BV -m comment --comment "masquerade LOCAL traffic for openshift-ingress/router-default:http LB IP" -m addrtype --src-type LOCAL -j KUBE-MARK-MASQ
-A KUBE-XLB-HEVFQXAKPPGAL4BV -m comment --comment "route LOCAL traffic for openshift-ingress/router-default:http LB IP to service chain" -m addrtype --src-type LOCAL -j KUBE-SVC-HEVFQXAKPPGAL4BV
-A KUBE-XLB-HEVFQXAKPPGAL4BV -m comment --comment "Balancing rule 0 for openshift-ingress/router-default:http" -j KUBE-SEP-77P4SFRX7UPHHEOO

Comment 16 Dan Williams 2020-05-14 03:12:41 UTC
(In reply to Dan Mace from comment #14)
> Quick update. Source IP is definitely getting lost on Azure when using OVN.
> Same OCP version on Azure with SDN works fine.
> 
> With OVN, the backend of a route (i.e. a server) is seeing various source
> IPs (via the `Forwarded` header) coming from the 100.64.0.0/10 (RFC 6598)
> CIDR[1] instead of the actual client source IP. This is the same regardless
> of whether the client is from the internet, or inside the cluster VPC (e.g.
> from a master node). Given the Azure Load Balancer, virtual networks,
> security groups, etc. are identical between these clusters, there's almost
> certainly something wrong with OVN in this context. Need to dig deeper.
> 
> [1]
> https://docs.microsoft.com/en-us/azure/virtual-network/virtual-networks-udr-
> overview

Can you give a quick overview of how packets are supposed to look? I can't bridge the gap between what I know about how ovn-kubernetes handles packets, to how ingress is supposed to work.

What sets the Forwarded header and where does it get that IP from?

When the packet hits the node what are the srcip/dstip/forwarded IPs supposed to be?

When the packet hits a host-network router pod and comes out the other side, what are the srcip/dstip/forwarded IPs supposed to be?

What if the router is nodeport?

ovn-kubernetes doesn't do anything above L3 so that's my understanding gap about the Forwarded header. Whatever sets that header would be receiving traffic from a 100.64/10 address, right?

Comment 17 Dan Winship 2020-05-14 21:36:39 UTC
So there was some context from Slack that didn't make it into this bug.

The goal of the test (as I understand it), is to ensure that the endpoint of a service pointed to by a Router can determine the IP of the client that tried to connect to it.

However, it is testing this by trying to connect to the load balancer from a *pod*, and assuming that a connection from a pod to a load balancer IP is not going to be masqueraded in between the pod and the load balancer. While that happens to be true in some cases (because there are iptables rules that prevent the connection from ever actually reaching the load balancer), I don't think that Kubernetes intends to guarantee that this is true, and therefore, I think the test case is invalid.

(In fact, IIRC, we already disable this test on AWS, for all network plugins, because it doesn't work there, because we don't get the iptables rule diverting the connection on AWS.)

To be correct, the test would have to try connecting to the load balancer IP from somewhere *outside the test cluster*; eg, we could have the test case try to connect directly from the test binary rather than spawning a pod into the test cluster and connecting from there. The problem then becomes figuring out what client IP we would expect the load balancer to see in that case, since in all likelihood the connection is going to get NATted somewhere between the test binary and the test cluster...

Comment 18 Phil Cameron 2020-05-18 16:14:46 UTC
Comment 10: https://github.com/openshift/origin/pull/24764 has merged. Is this still a problem?

Comment 19 Dan Winship 2020-05-18 16:27:14 UTC
If e2e-azure-ovn was previously release-information and was made non-release-informing because of this, then we need to get it made release-informing again.

Other than that, I believe there is no longer any Networking team bug here.

Comment 21 Ben Bennett 2020-05-20 13:34:57 UTC

*** This bug has been marked as a duplicate of bug 1802311 ***


Note You need to log in before you can comment on or make changes to this bug.