Bug 1806067
| Summary: | Ingress not supported on Azure IPv6 | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Dan Winship <danw> |
| Component: | Networking | Assignee: | Andrew McDermott <amcdermo> |
| Networking sub component: | router | QA Contact: | Hongan Li <hongli> |
| Status: | CLOSED WONTFIX | Docs Contact: | |
| Severity: | medium | ||
| Priority: | low | CC: | amcdermo, aos-bugs, bbennett, ccoleman, erich, xtian |
| Version: | 4.3.z | Keywords: | Reopened, TestBlocker |
| Target Milestone: | --- | ||
| Target Release: | 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-09-04 15:20:32 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
ok, ingress clusteroperator says:
- type: Degraded
status: "True"
reason: IngressControllersDegraded
message: 'Some ingresscontrollers are degraded: default'
default ingresscontroller says:
- type: DNSReady
status: "False"
reason: FailedZones
message: 'The record failed to provision in some zones: [{/subscriptions/5970b0fe-21de-4e1a-a192-0a785017e3b7/resourceGroups/dwinship-ipv6-43-tvkrk-rg/providers/Microsoft.Network/privateDnsZones/dwinship-ipv6-43.sdn.azure.devcluster.openshift.com
map[]} {/subscriptions/5970b0fe-21de-4e1a-a192-0a785017e3b7/resourceGroups/os4-common/providers/Microsoft.Network/dnszones/sdn.azure.devcluster.openshift.com
map[]}]'
default-wildcard dnsrecord says:
- type: Failed
status: "True"
reason: ProviderError
message: 'The DNS provider failed to ensure the record: failed to update dns
a record: *.apps.dwinship-ipv6-43.sdn.azure.devcluster.openshift.com: azure.BearerAuthorizer#WithAuthorization:
Failed to refresh the Token for request to https://management.azure.com/subscriptions/5970b0fe-21de-4e1a-a192-0a785017e3b7/resourceGroups/dwinship-ipv6-43-tvkrk-rg/providers/Microsoft.Network/privateDnsZones/dwinship-ipv6-43.sdn.azure.devcluster.openshift.com/A/*.apps?api-version=2018-09-01:
StatusCode=0 -- Original Error: adal: Failed to execute the refresh request.
Error = ''Post https://login.microsoftonline.com/6047c7e9-b2ad-488d-a54e-dc3f6be6a7ee/oauth2/token?api-version=1.0:
dial tcp 20.190.134.9:443: connect: network is unreachable'''
and now I see that due to a bad rebase, ingress-operator had gotten dropped from the list of pods that need hacked-up external IPv4 access.
OK, reopening... even with working DNS, ingress does not work.
The installer seems to create a ${CLUSTER_NAME}-public-lb load balancer that is used for the apiserver which is dual-stack:
danw@p50:installer (release-4.3 $)> host api.dwinship-ipv6-43.sdn.azure.devcluster.openshift.com
api.dwinship-ipv6-43.sdn.azure.devcluster.openshift.com is an alias for dwinship-ipv6-43-nzxlv.centralus.cloudapp.azure.com.
dwinship-ipv6-43-nzxlv.centralus.cloudapp.azure.com has address 52.154.163.125
dwinship-ipv6-43-nzxlv.centralus.cloudapp.azure.com has IPv6 address 2603:1030:b:3::48
but we get a ${CLUSTER_NAME} load balancer that is used for router-default that is single-stack IPv4:
danw@p50:installer (release-4.3 $)> host oauth-openshift.apps.dwinship-ipv6-43.sdn.azure.devcluster.openshift.com
oauth-openshift.apps.dwinship-ipv6-43.sdn.azure.devcluster.openshift.com has address 13.86.5.132
connections to this LB do not succeed, causing, eg:
danw@p50:installer (release-4.3 $)> oc get clusteroperator authentication -o yaml
...
status:
conditions:
- lastTransitionTime: "2020-02-23T23:57:05Z"
message: 'RouteHealthDegraded: failed to GET route: dial tcp: i/o timeout'
reason: RouteHealthDegradedFailedGet
status: "True"
type: Degraded
I'm not sure if that load balancer is initially created by kube-controller-manager or the installer, but kube-controller-manager at least eventually takes over maintenance of it.
The Azure kube CloudProvider does not appear to support single-stack IPv6; all of the code for handling IPv6 is inside checks for dual-stack being enabled, and in several places there are explicit comments about not supporting single-stack IPv6. (eg, https://github.com/openshift/origin/blob/20075b26/vendor/k8s.io/kubernetes/staging/src/k8s.io/legacy-cloud-providers/azure/azure_loadbalancer.go#L564. There are no relevant differences between kube 1.16 and kube master.)
I played around with just unconditionally enabling all the dual-stack code... I'm not sure it actually works even for dual-stack as-is though, because it doesn't seem to take into account the fact that you can't create a single-stack IPv6 load balancer; you have to create a dual-stack load balancer even if you only want IPv6 backends. https://github.com/openshift-kni/origin/commit/56931373 is what I came up with, which does not yet work. Although the Azure console now shows that the ${CLUSTER_NAME} load balancer has both IPv4 and IPv6 frontend IPs, kube-controller-manager repeatedly complains that:
I0223 23:56:28.357260 1 azure_backoff.go:287] LoadBalancerClient.CreateOrUpdate(dwinship-ipv6-hacked-8bnw2): end
E0223 23:56:28.357296 1 azure_backoff.go:749] processHTTPRetryResponse: backoff failure, will retry, err=Code="AtleastOneIpV4RequiredForIpV6LbFrontendIpConfiguration" Message="At least one IPv4 frontend ipConfiguration is required for an IPv6 frontend ipConfiguration on the load balancer '/subscriptions/5970b0fe-21de-4e1a-a192-0a785017e3b7/resourceGroups/dwinship-ipv6-hacked-8bnw2-rg/providers/Microsoft.Network/loadBalancers/dwinship-ipv6-hacked-8bnw2'" Details=[]
which in turn leads to
danw@p50:installer (release-4.3 $)> oc get services -n openshift-ingress router-default
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
router-default LoadBalancer fd02::a8eb <pending> 80:30564/TCP,443:31817/TCP 35m
Google isn't turning up anything useful about the error message.
You can't create a partial IPv6 only LB. I suspect your change is missing the second configuration when you create it (you need ip_address_version set to IPv4 for one and IPv6 for the other) I think the cloud provider code need some large refactoring. The frontend config setup code makes foundational assumptions about a single frontend config, and the IPv6 config is bolted on such that no frontend->backend rules are generated for the ipv4 config, and to get them generated. It'll take some work, but I think it's fixable. Moving to 4.5 because we won't block the release, but I may change it back to a 4.4[.z] if the fix comes more quickly than expected. The work is becoming much more involved than anticipated. Moving out to 4.6 I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. Target reset to 4.7 while investigation is either ongoing or not yet started. Will be considered for earlier release versions when diagnosed and resolved. |
We now have a kinda-almost-working Azure 4.3 IPv6 install. Except ingress isn't working. NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication Unknown Unknown True 42m cloud-credential 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 47m cluster-autoscaler 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 40m console 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 False True False 35m dns 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 44m image-registry False True False 40m ingress 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False True 34m insights 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 46m kube-apiserver 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 43m kube-controller-manager 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 43m kube-scheduler 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 43m machine-api 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 45m machine-config 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 45m marketplace 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 40m monitoring 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 33m network 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 45m node-tuning 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 42m openshift-apiserver 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 40m openshift-controller-manager 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 43m openshift-samples 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 40m operator-lifecycle-manager 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 45m operator-lifecycle-manager-catalog 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 45m operator-lifecycle-manager-packageserver 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 41m service-ca 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 46m service-catalog-apiserver 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 42m service-catalog-controller-manager 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 42m storage 4.3.0-0.nightly-2020-02-21-091838-ipv6.2d1 True False False 40m nothing super obvious in either the ingress-operator or router-default logs: ingress operator has lots of: 2020-02-21T22:29:32.608Z INFO operator.ingress_controller ingress/controller.go:136 reconciling {"request": "openshift-ingress-operator/default"} 2020-02-21T22:29:32.694Z INFO operator.ingress_controller ingress/deployment.go:742 updated router deployment {"namespace": "openshift-ingress", "name": "router-default"} 2020-02-21T22:29:32.757Z ERROR operator.ingress_controller ingress/controller.go:203 got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded"} 2020-02-21T22:29:32.757Z INFO operator.ingress_controller ingress/controller.go:136 reconciling {"request": "openshift-ingress-operator/default"} 2020-02-21T22:29:32.818Z INFO operator.ingress_controller ingress/deployment.go:742 updated router deployment {"namespace": "openshift-ingress", "name": "router-default"} 2020-02-21T22:29:32.878Z ERROR operator.ingress_controller ingress/controller.go:203 got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded"} while router-default just keeps saying: I0221 21:45:42.842184 1 router.go:548] template "level"=0 "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"