Bug 1806067

Summary: Ingress not supported on Azure IPv6
Product: OpenShift Container Platform Reporter: Dan Winship <danw>
Component: NetworkingAssignee: Andrew McDermott <amcdermo>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED WONTFIX Docs Contact:
Severity: medium    
Priority: low CC: amcdermo, aos-bugs, bbennett, ccoleman, erich, xtian
Version: 4.3.zKeywords: Reopened, TestBlocker
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-04 15:20:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dan Winship 2020-02-21 22:31:05 UTC
We now have a kinda-almost-working Azure 4.3 IPv6 install. Except ingress isn't working.

NAME                                       VERSION                                      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                                                          Unknown     Unknown       True       42m
cloud-credential                           4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      47m
cluster-autoscaler                         4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      40m
console                                    4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   False       True          False      35m
dns                                        4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      44m
image-registry                                                                          False       True          False      40m
ingress                                    4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         True       34m
insights                                   4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      46m
kube-apiserver                             4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      43m
kube-controller-manager                    4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      43m
kube-scheduler                             4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      43m
machine-api                                4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      45m
machine-config                             4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      45m
marketplace                                4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      40m
monitoring                                 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      33m
network                                    4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      45m
node-tuning                                4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      42m
openshift-apiserver                        4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      40m
openshift-controller-manager               4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      43m
openshift-samples                          4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      40m
operator-lifecycle-manager                 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      45m
operator-lifecycle-manager-catalog         4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      45m
operator-lifecycle-manager-packageserver   4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      41m
service-ca                                 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      46m
service-catalog-apiserver                  4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      42m
service-catalog-controller-manager         4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      42m
storage                                    4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1   True        False         False      40m

nothing super obvious in either the ingress-operator or router-default logs:

ingress operator has lots of:

2020-02-21T22:29:32.608Z	INFO	operator.ingress_controller	ingress/controller.go:136	reconciling	{"request": "openshift-ingress-operator/default"}
2020-02-21T22:29:32.694Z	INFO	operator.ingress_controller	ingress/deployment.go:742	updated router deployment	{"namespace": "openshift-ingress", "name": "router-default"}
2020-02-21T22:29:32.757Z	ERROR	operator.ingress_controller	ingress/controller.go:203	got retryable error; requeueing	{"after": "1m0s", "error": "IngressController is degraded"}
2020-02-21T22:29:32.757Z	INFO	operator.ingress_controller	ingress/controller.go:136	reconciling	{"request": "openshift-ingress-operator/default"}
2020-02-21T22:29:32.818Z	INFO	operator.ingress_controller	ingress/deployment.go:742	updated router deployment	{"namespace": "openshift-ingress", "name": "router-default"}
2020-02-21T22:29:32.878Z	ERROR	operator.ingress_controller	ingress/controller.go:203	got retryable error; requeueing	{"after": "1m0s", "error": "IngressController is degraded"}

while router-default just keeps saying:

I0221 21:45:42.842184       1 router.go:548] template "level"=0 "msg"="router reloaded"  "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"

Comment 3 Dan Winship 2020-02-22 14:51:29 UTC
ok, ingress clusteroperator says:

  - type: Degraded
    status: "True"
    reason: IngressControllersDegraded
    message: 'Some ingresscontrollers are degraded: default'

default ingresscontroller says:

  - type: DNSReady
    status: "False"
    reason: FailedZones
    message: 'The record failed to provision in some zones: [{/subscriptions/5970b0fe-21de-4e1a-a192-0a785017e3b7/resourceGroups/dwinship-ipv6-43-tvkrk-rg/providers/Microsoft.Network/privateDnsZones/dwinship-ipv6-43.sdn.azure.devcluster.openshift.com
      map[]} {/subscriptions/5970b0fe-21de-4e1a-a192-0a785017e3b7/resourceGroups/os4-common/providers/Microsoft.Network/dnszones/sdn.azure.devcluster.openshift.com
      map[]}]'

default-wildcard dnsrecord says:

    - type: Failed
      status: "True"
      reason: ProviderError
      message: 'The DNS provider failed to ensure the record: failed to update dns
        a record: *.apps.dwinship-ipv6-43.sdn.azure.devcluster.openshift.com: azure.BearerAuthorizer#WithAuthorization:
        Failed to refresh the Token for request to https://management.azure.com/subscriptions/5970b0fe-21de-4e1a-a192-0a785017e3b7/resourceGroups/dwinship-ipv6-43-tvkrk-rg/providers/Microsoft.Network/privateDnsZones/dwinship-ipv6-43.sdn.azure.devcluster.openshift.com/A/*.apps?api-version=2018-09-01:
        StatusCode=0 -- Original Error: adal: Failed to execute the refresh request.
        Error = ''Post https://login.microsoftonline.com/6047c7e9-b2ad-488d-a54e-dc3f6be6a7ee/oauth2/token?api-version=1.0:
        dial tcp 20.190.134.9:443: connect: network is unreachable'''

and now I see that due to a bad rebase, ingress-operator had gotten dropped from the list of pods that need hacked-up external IPv4 access.

Comment 4 Dan Winship 2020-02-24 00:28:14 UTC
OK, reopening... even with working DNS, ingress does not work.

The installer seems to create a ${CLUSTER_NAME}-public-lb load balancer that is used for the apiserver which is dual-stack:

    danw@p50:installer (release-4.3 $)> host api.dwinship-ipv6-43.sdn.azure.devcluster.openshift.com
    api.dwinship-ipv6-43.sdn.azure.devcluster.openshift.com is an alias for dwinship-ipv6-43-nzxlv.centralus.cloudapp.azure.com.
    dwinship-ipv6-43-nzxlv.centralus.cloudapp.azure.com has address 52.154.163.125
    dwinship-ipv6-43-nzxlv.centralus.cloudapp.azure.com has IPv6 address 2603:1030:b:3::48

but we get a ${CLUSTER_NAME} load balancer that is used for router-default that is single-stack IPv4:

    danw@p50:installer (release-4.3 $)> host oauth-openshift.apps.dwinship-ipv6-43.sdn.azure.devcluster.openshift.com
    oauth-openshift.apps.dwinship-ipv6-43.sdn.azure.devcluster.openshift.com has address 13.86.5.132

connections to this LB do not succeed, causing, eg:

    danw@p50:installer (release-4.3 $)> oc get clusteroperator authentication -o yaml
    ...
    status:
      conditions:
      - lastTransitionTime: "2020-02-23T23:57:05Z"
        message: 'RouteHealthDegraded: failed to GET route: dial tcp: i/o timeout'
        reason: RouteHealthDegradedFailedGet
        status: "True"
        type: Degraded

I'm not sure if that load balancer is initially created by kube-controller-manager or the installer, but kube-controller-manager at least eventually takes over maintenance of it.

The Azure kube CloudProvider does not appear to support single-stack IPv6; all of the code for handling IPv6 is inside checks for dual-stack being enabled, and in several places there are explicit comments about not supporting single-stack IPv6. (eg, https://github.com/openshift/origin/blob/20075b26/vendor/k8s.io/kubernetes/staging/src/k8s.io/legacy-cloud-providers/azure/azure_loadbalancer.go#L564. There are no relevant differences between kube 1.16 and kube master.) 

I played around with just unconditionally enabling all the dual-stack code... I'm not sure it actually works even for dual-stack as-is though, because it doesn't seem to take into account the fact that you can't create a single-stack IPv6 load balancer; you have to create a dual-stack load balancer even if you only want IPv6 backends. https://github.com/openshift-kni/origin/commit/56931373 is what I came up with, which does not yet work. Although the Azure console now shows that the ${CLUSTER_NAME} load balancer has both IPv4 and IPv6 frontend IPs, kube-controller-manager repeatedly complains that:

    I0223 23:56:28.357260       1 azure_backoff.go:287] LoadBalancerClient.CreateOrUpdate(dwinship-ipv6-hacked-8bnw2): end
    E0223 23:56:28.357296       1 azure_backoff.go:749] processHTTPRetryResponse: backoff failure, will retry, err=Code="AtleastOneIpV4RequiredForIpV6LbFrontendIpConfiguration" Message="At least one IPv4 frontend ipConfiguration is required for an IPv6 frontend ipConfiguration on the load balancer '/subscriptions/5970b0fe-21de-4e1a-a192-0a785017e3b7/resourceGroups/dwinship-ipv6-hacked-8bnw2-rg/providers/Microsoft.Network/loadBalancers/dwinship-ipv6-hacked-8bnw2'" Details=[]

which in turn leads to

    danw@p50:installer (release-4.3 $)> oc get services -n openshift-ingress router-default
    NAME             TYPE           CLUSTER-IP   EXTERNAL-IP   PORT(S)                      AGE
    router-default   LoadBalancer   fd02::a8eb   <pending>     80:30564/TCP,443:31817/TCP   35m

Google isn't turning up anything useful about the error message.

Comment 5 Clayton Coleman 2020-02-26 21:22:09 UTC
You can't create a partial IPv6 only LB.  I suspect your change is missing the second configuration when you create it (you need ip_address_version set to IPv4 for one and IPv6 for the other)

Comment 6 Dan Mace 2020-03-02 23:37:07 UTC
I think the cloud provider code need some large refactoring. The frontend config setup code makes foundational assumptions about a single frontend config, and the IPv6 config is bolted on such that no frontend->backend rules are generated for the ipv4 config, and to get them generated. It'll take some work, but I think it's fixable.

Comment 7 Dan Mace 2020-03-03 19:52:37 UTC
Moving to 4.5 because we won't block the release, but I may change it back to a 4.4[.z] if the fix comes more quickly than expected. The work is becoming much more involved than anticipated.

Comment 11 Andrew McDermott 2020-05-19 14:49:51 UTC
Moving out to 4.6

Comment 12 Andrew McDermott 2020-06-17 09:55:18 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 13 Andrew McDermott 2020-07-09 12:05:05 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 14 Andrew McDermott 2020-07-30 09:59:58 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 15 mfisher 2020-08-18 19:55:50 UTC
Target reset to 4.7 while investigation is either ongoing or not yet started.  Will be considered for earlier release versions when diagnosed and resolved.