We now have a kinda-almost-working Azure 4.3 IPv6 install. Except ingress isn't working. NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication Unknown Unknown True 42m cloud-credential 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 47m cluster-autoscaler 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 40m console 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 False True False 35m dns 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 44m image-registry False True False 40m ingress 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False True 34m insights 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 46m kube-apiserver 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 43m kube-controller-manager 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 43m kube-scheduler 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 43m machine-api 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 45m machine-config 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 45m marketplace 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 40m monitoring 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 33m network 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 45m node-tuning 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 42m openshift-apiserver 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 40m openshift-controller-manager 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 43m openshift-samples 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 40m operator-lifecycle-manager 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 45m operator-lifecycle-manager-catalog 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 45m operator-lifecycle-manager-packageserver 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 41m service-ca 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 46m service-catalog-apiserver 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 42m service-catalog-controller-manager 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 42m storage 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 40m nothing super obvious in either the ingress-operator or router-default logs: ingress operator has lots of: 2020-02-21T22:29:32.608Z INFO operator.ingress_controller ingress/controller.go:136 reconciling {"request": "openshift-ingress-operator/default"} 2020-02-21T22:29:32.694Z INFO operator.ingress_controller ingress/deployment.go:742 updated router deployment {"namespace": "openshift-ingress", "name": "router-default"} 2020-02-21T22:29:32.757Z ERROR operator.ingress_controller ingress/controller.go:203 got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded"} 2020-02-21T22:29:32.757Z INFO operator.ingress_controller ingress/controller.go:136 reconciling {"request": "openshift-ingress-operator/default"} 2020-02-21T22:29:32.818Z INFO operator.ingress_controller ingress/deployment.go:742 updated router deployment {"namespace": "openshift-ingress", "name": "router-default"} 2020-02-21T22:29:32.878Z ERROR operator.ingress_controller ingress/controller.go:203 got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded"} while router-default just keeps saying: I0221 21:45:42.842184 1 router.go:548] template "level"=0 "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
ok, ingress clusteroperator says: - type: Degraded status: "True" reason: IngressControllersDegraded message: 'Some ingresscontrollers are degraded: default' default ingresscontroller says: - type: DNSReady status: "False" reason: FailedZones message: 'The record failed to provision in some zones: [{/subscriptions/5970b0fe-21de-4e1a-a192-0a785017e3b7/resourceGroups/dwinship-ipv6-43-tvkrk-rg/providers/Microsoft.Network/privateDnsZones/dwinship-ipv6-43.sdn.azure.devcluster.openshift.com map[]} {/subscriptions/5970b0fe-21de-4e1a-a192-0a785017e3b7/resourceGroups/os4-common/providers/Microsoft.Network/dnszones/sdn.azure.devcluster.openshift.com map[]}]' default-wildcard dnsrecord says: - type: Failed status: "True" reason: ProviderError message: 'The DNS provider failed to ensure the record: failed to update dns a record: *.apps.dwinship-ipv6-43.sdn.azure.devcluster.openshift.com: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/5970b0fe-21de-4e1a-a192-0a785017e3b7/resourceGroups/dwinship-ipv6-43-tvkrk-rg/providers/Microsoft.Network/privateDnsZones/dwinship-ipv6-43.sdn.azure.devcluster.openshift.com/A/*.apps?api-version=2018-09-01: StatusCode=0 -- Original Error: adal: Failed to execute the refresh request. Error = ''Post https://login.microsoftonline.com/6047c7e9-b2ad-488d-a54e-dc3f6be6a7ee/oauth2/token?api-version=1.0: dial tcp 20.190.134.9:443: connect: network is unreachable''' and now I see that due to a bad rebase, ingress-operator had gotten dropped from the list of pods that need hacked-up external IPv4 access.
OK, reopening... even with working DNS, ingress does not work. The installer seems to create a ${CLUSTER_NAME}-public-lb load balancer that is used for the apiserver which is dual-stack: danw@p50:installer (release-4.3 $)> host api.dwinship-ipv6-43.sdn.azure.devcluster.openshift.com api.dwinship-ipv6-43.sdn.azure.devcluster.openshift.com is an alias for dwinship-ipv6-43-nzxlv.centralus.cloudapp.azure.com. dwinship-ipv6-43-nzxlv.centralus.cloudapp.azure.com has address 52.154.163.125 dwinship-ipv6-43-nzxlv.centralus.cloudapp.azure.com has IPv6 address 2603:1030:b:3::48 but we get a ${CLUSTER_NAME} load balancer that is used for router-default that is single-stack IPv4: danw@p50:installer (release-4.3 $)> host oauth-openshift.apps.dwinship-ipv6-43.sdn.azure.devcluster.openshift.com oauth-openshift.apps.dwinship-ipv6-43.sdn.azure.devcluster.openshift.com has address 13.86.5.132 connections to this LB do not succeed, causing, eg: danw@p50:installer (release-4.3 $)> oc get clusteroperator authentication -o yaml ... status: conditions: - lastTransitionTime: "2020-02-23T23:57:05Z" message: 'RouteHealthDegraded: failed to GET route: dial tcp: i/o timeout' reason: RouteHealthDegradedFailedGet status: "True" type: Degraded I'm not sure if that load balancer is initially created by kube-controller-manager or the installer, but kube-controller-manager at least eventually takes over maintenance of it. The Azure kube CloudProvider does not appear to support single-stack IPv6; all of the code for handling IPv6 is inside checks for dual-stack being enabled, and in several places there are explicit comments about not supporting single-stack IPv6. (eg, https://github.com/openshift/origin/blob/20075b26/vendor/k8s.io/kubernetes/staging/src/k8s.io/legacy-cloud-providers/azure/azure_loadbalancer.go#L564. There are no relevant differences between kube 1.16 and kube master.) I played around with just unconditionally enabling all the dual-stack code... I'm not sure it actually works even for dual-stack as-is though, because it doesn't seem to take into account the fact that you can't create a single-stack IPv6 load balancer; you have to create a dual-stack load balancer even if you only want IPv6 backends. https://github.com/openshift-kni/origin/commit/56931373 is what I came up with, which does not yet work. Although the Azure console now shows that the ${CLUSTER_NAME} load balancer has both IPv4 and IPv6 frontend IPs, kube-controller-manager repeatedly complains that: I0223 23:56:28.357260 1 azure_backoff.go:287] LoadBalancerClient.CreateOrUpdate(dwinship-ipv6-hacked-8bnw2): end E0223 23:56:28.357296 1 azure_backoff.go:749] processHTTPRetryResponse: backoff failure, will retry, err=Code="AtleastOneIpV4RequiredForIpV6LbFrontendIpConfiguration" Message="At least one IPv4 frontend ipConfiguration is required for an IPv6 frontend ipConfiguration on the load balancer '/subscriptions/5970b0fe-21de-4e1a-a192-0a785017e3b7/resourceGroups/dwinship-ipv6-hacked-8bnw2-rg/providers/Microsoft.Network/loadBalancers/dwinship-ipv6-hacked-8bnw2'" Details=[] which in turn leads to danw@p50:installer (release-4.3 $)> oc get services -n openshift-ingress router-default NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-default LoadBalancer fd02::a8eb <pending> 80:30564/TCP,443:31817/TCP 35m Google isn't turning up anything useful about the error message.
You can't create a partial IPv6 only LB. I suspect your change is missing the second configuration when you create it (you need ip_address_version set to IPv4 for one and IPv6 for the other)
I think the cloud provider code need some large refactoring. The frontend config setup code makes foundational assumptions about a single frontend config, and the IPv6 config is bolted on such that no frontend->backend rules are generated for the ipv4 config, and to get them generated. It'll take some work, but I think it's fixable.
Moving to 4.5 because we won't block the release, but I may change it back to a 4.4[.z] if the fix comes more quickly than expected. The work is becoming much more involved than anticipated.
Moving out to 4.6
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
Target reset to 4.7 while investigation is either ongoing or not yet started. Will be considered for earlier release versions when diagnosed and resolved.