1806067 – Ingress not supported on Azure IPv6

Bug 1806067 - Ingress not supported on Azure IPv6

Summary: Ingress not supported on Azure IPv6

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Andrew McDermott
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-21 22:31 UTC by Dan Winship
Modified:	2022-08-04 22:27 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-04 15:20:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Dan Winship 2020-02-21 22:31:05 UTC

We now have a kinda-almost-working Azure 4.3 IPv6 install. Except ingress isn't working.

NAME                                       VERSION                                      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                                                          Unknown     Unknown       True       42m
cloud-credential                           4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      47m
cluster-autoscaler                         4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      40m
console                                    4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   False       True          False      35m
dns                                        4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      44m
image-registry                                                                          False       True          False      40m
ingress                                    4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         True       34m
insights                                   4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      46m
kube-apiserver                             4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      43m
kube-controller-manager                    4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      43m
kube-scheduler                             4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      43m
machine-api                                4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      45m
machine-config                             4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      45m
marketplace                                4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      40m
monitoring                                 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      33m
network                                    4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      45m
node-tuning                                4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      42m
openshift-apiserver                        4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      40m
openshift-controller-manager               4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      43m
openshift-samples                          4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      40m
operator-lifecycle-manager                 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      45m
operator-lifecycle-manager-catalog         4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      45m
operator-lifecycle-manager-packageserver   4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      41m
service-ca                                 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      46m
service-catalog-apiserver                  4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      42m
service-catalog-controller-manager         4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      42m
storage                                    4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      40m

nothing super obvious in either the ingress-operator or router-default logs:

ingress operator has lots of:

2020-02-21T22:29:32.608Z	INFO	operator.ingress_controller	ingress/controller.go:136	reconciling	{"request": "openshift-ingress-operator/default"}
2020-02-21T22:29:32.694Z	INFO	operator.ingress_controller	ingress/deployment.go:742	updated router deployment	{"namespace": "openshift-ingress", "name": "router-default"}
2020-02-21T22:29:32.757Z	ERROR	operator.ingress_controller	ingress/controller.go:203	got retryable error; requeueing	{"after": "1m0s", "error": "IngressController is degraded"}
2020-02-21T22:29:32.757Z	INFO	operator.ingress_controller	ingress/controller.go:136	reconciling	{"request": "openshift-ingress-operator/default"}
2020-02-21T22:29:32.818Z	INFO	operator.ingress_controller	ingress/deployment.go:742	updated router deployment	{"namespace": "openshift-ingress", "name": "router-default"}
2020-02-21T22:29:32.878Z	ERROR	operator.ingress_controller	ingress/controller.go:203	got retryable error; requeueing	{"after": "1m0s", "error": "IngressController is degraded"}

while router-default just keeps saying:

I0221 21:45:42.842184       1 router.go:548] template "level"=0 "msg"="router reloaded"  "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"

Comment 3 Dan Winship 2020-02-22 14:51:29 UTC

ok, ingress clusteroperator says:

  - type: Degraded
    status: "True"
    reason: IngressControllersDegraded
    message: 'Some ingresscontrollers are degraded: default'

default ingresscontroller says:

  - type: DNSReady
    status: "False"
    reason: FailedZones
    message: 'The record failed to provision in some zones: [{/subscriptions/5970b0fe-21de-4e1a-a192-0a785017e3b7/resourceGroups/dwinship-ipv6-43-tvkrk-rg/providers/Microsoft.Network/privateDnsZones/dwinship-ipv6-43.sdn.azure.devcluster.openshift.com
      map[]} {/subscriptions/5970b0fe-21de-4e1a-a192-0a785017e3b7/resourceGroups/os4-common/providers/Microsoft.Network/dnszones/sdn.azure.devcluster.openshift.com
      map[]}]'

default-wildcard dnsrecord says:

    - type: Failed
      status: "True"
      reason: ProviderError
      message: 'The DNS provider failed to ensure the record: failed to update dns
        a record: *.apps.dwinship-ipv6-43.sdn.azure.devcluster.openshift.com: azure.BearerAuthorizer#WithAuthorization:
        Failed to refresh the Token for request to https://management.azure.com/subscriptions/5970b0fe-21de-4e1a-a192-0a785017e3b7/resourceGroups/dwinship-ipv6-43-tvkrk-rg/providers/Microsoft.Network/privateDnsZones/dwinship-ipv6-43.sdn.azure.devcluster.openshift.com/A/*.apps?api-version=2018-09-01:
        StatusCode=0 -- Original Error: adal: Failed to execute the refresh request.
        Error = ''Post https://login.microsoftonline.com/6047c7e9-b2ad-488d-a54e-dc3f6be6a7ee/oauth2/token?api-version=1.0:
        dial tcp 20.190.134.9:443: connect: network is unreachable'''

and now I see that due to a bad rebase, ingress-operator had gotten dropped from the list of pods that need hacked-up external IPv4 access.

Comment 4 Dan Winship 2020-02-24 00:28:14 UTC

OK, reopening... even with working DNS, ingress does not work.

The installer seems to create a ${CLUSTER_NAME}-public-lb load balancer that is used for the apiserver which is dual-stack:

    danw@p50:installer (release-4.3 $)> host api.dwinship-ipv6-43.sdn.azure.devcluster.openshift.com
    api.dwinship-ipv6-43.sdn.azure.devcluster.openshift.com is an alias for dwinship-ipv6-43-nzxlv.centralus.cloudapp.azure.com.
    dwinship-ipv6-43-nzxlv.centralus.cloudapp.azure.com has address 52.154.163.125
    dwinship-ipv6-43-nzxlv.centralus.cloudapp.azure.com has IPv6 address 2603:1030:b:3::48

but we get a ${CLUSTER_NAME} load balancer that is used for router-default that is single-stack IPv4:

    danw@p50:installer (release-4.3 $)> host oauth-openshift.apps.dwinship-ipv6-43.sdn.azure.devcluster.openshift.com
    oauth-openshift.apps.dwinship-ipv6-43.sdn.azure.devcluster.openshift.com has address 13.86.5.132

connections to this LB do not succeed, causing, eg:

    danw@p50:installer (release-4.3 $)> oc get clusteroperator authentication -o yaml
    ...
    status:
      conditions:
      - lastTransitionTime: "2020-02-23T23:57:05Z"
        message: 'RouteHealthDegraded: failed to GET route: dial tcp: i/o timeout'
        reason: RouteHealthDegradedFailedGet
        status: "True"
        type: Degraded

I'm not sure if that load balancer is initially created by kube-controller-manager or the installer, but kube-controller-manager at least eventually takes over maintenance of it.

The Azure kube CloudProvider does not appear to support single-stack IPv6; all of the code for handling IPv6 is inside checks for dual-stack being enabled, and in several places there are explicit comments about not supporting single-stack IPv6. (eg, https://github.com/openshift/origin/blob/20075b26/vendor/k8s.io/kubernetes/staging/src/k8s.io/legacy-cloud-providers/azure/azure_loadbalancer.go#L564. There are no relevant differences between kube 1.16 and kube master.) 

I played around with just unconditionally enabling all the dual-stack code... I'm not sure it actually works even for dual-stack as-is though, because it doesn't seem to take into account the fact that you can't create a single-stack IPv6 load balancer; you have to create a dual-stack load balancer even if you only want IPv6 backends. https://github.com/openshift-kni/origin/commit/56931373 is what I came up with, which does not yet work. Although the Azure console now shows that the ${CLUSTER_NAME} load balancer has both IPv4 and IPv6 frontend IPs, kube-controller-manager repeatedly complains that:

    I0223 23:56:28.357260       1 azure_backoff.go:287] LoadBalancerClient.CreateOrUpdate(dwinship-ipv6-hacked-8bnw2): end
    E0223 23:56:28.357296       1 azure_backoff.go:749] processHTTPRetryResponse: backoff failure, will retry, err=Code="AtleastOneIpV4RequiredForIpV6LbFrontendIpConfiguration" Message="At least one IPv4 frontend ipConfiguration is required for an IPv6 frontend ipConfiguration on the load balancer '/subscriptions/5970b0fe-21de-4e1a-a192-0a785017e3b7/resourceGroups/dwinship-ipv6-hacked-8bnw2-rg/providers/Microsoft.Network/loadBalancers/dwinship-ipv6-hacked-8bnw2'" Details=[]

which in turn leads to

    danw@p50:installer (release-4.3 $)> oc get services -n openshift-ingress router-default
    NAME             TYPE           CLUSTER-IP   EXTERNAL-IP   PORT(S)                      AGE
    router-default   LoadBalancer   fd02::a8eb   <pending>     80:30564/TCP,443:31817/TCP   35m

Google isn't turning up anything useful about the error message.

Comment 5 Clayton Coleman 2020-02-26 21:22:09 UTC

You can't create a partial IPv6 only LB.  I suspect your change is missing the second configuration when you create it (you need ip_address_version set to IPv4 for one and IPv6 for the other)

Comment 6 Dan Mace 2020-03-02 23:37:07 UTC

I think the cloud provider code need some large refactoring. The frontend config setup code makes foundational assumptions about a single frontend config, and the IPv6 config is bolted on such that no frontend->backend rules are generated for the ipv4 config, and to get them generated. It'll take some work, but I think it's fixable.

Comment 7 Dan Mace 2020-03-03 19:52:37 UTC

Moving to 4.5 because we won't block the release, but I may change it back to a 4.4[.z] if the fix comes more quickly than expected. The work is becoming much more involved than anticipated.

Comment 11 Andrew McDermott 2020-05-19 14:49:51 UTC

Moving out to 4.6

Comment 12 Andrew McDermott 2020-06-17 09:55:18 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 13 Andrew McDermott 2020-07-09 12:05:05 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 14 Andrew McDermott 2020-07-30 09:59:58 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 15 mfisher 2020-08-18 19:55:50 UTC

Target reset to 4.7 while investigation is either ongoing or not yet started.  Will be considered for earlier release versions when diagnosed and resolved.

Note You need to log in before you can comment on or make changes to this bug.