Bug 1813062
Summary: | DNS service IP already allocated | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | John Coleman <jocolema> | |
Component: | Networking | Assignee: | Daneyon Hansen <dhansen> | |
Networking sub component: | router | QA Contact: | Hongan Li <hongli> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | medium | |||
Priority: | unspecified | CC: | aconstan, amcdermo, aos-bugs, bbennett, dhansen, fhirtz, maszulik, mfojtik | |
Version: | 4.2.z | |||
Target Milestone: | --- | |||
Target Release: | 4.6.0 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | SDN-CUST-IMPACT | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1891979 (view as bug list) | Environment: | ||
Last Closed: | 2020-10-27 15:56:42 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1891979 |
Description
John Coleman
2020-03-12 20:29:04 UTC
Seems unlikely this is a regression, so I'm moving it out of the 4.4 release. network.config/cluster shows a service cidr of 10.253.158.0/24. This means available host addresses are 10.253.158.1-.254. The service IP assigned to the default cluster dns service is 10.253.159.10, which is outside of the service cidr scope. This appears to be a the results of a degraded state caused by a network.config/cluster misconfiguration. SDN IPAM should not be allocating the <SERVICE_CIDR>.10 address. This address is reserved for the DNS service IP. I see the Calico plugin is in play, so maybe a static IP was assigned [1] to the openshift-marketplace service? Reassigning to the SDN team to provide input. [1] https://docs.projectcalico.org/v3.10/networking/use-specific-ip Hi
> One thing to note is that the customer did tweak the SDN and Service CIDR's, they also installed with Calico plugin.
Tweaking the service CIDR is not supported post cluster installation. This could the reason for the IP collision.
I will have to close this bug as WONTFIX.
-Alex
NodeIPAMController although it's part of kube-controller-manager is under the networking team, so moving accordingly. The dns operator allocates the 10th IP from the network config serviceCIDR [1] to the DNS service cluster IP [2]. The marketplace operator should not be getting installed before the dns operator, see [3][4][5] for details. It does not appear that the marketplace operator follows the run level schema identified in [3]. However, I followed the Calico install guide [6] for OCP and had no issue installing a cluster using 4.5.0-0.nightly-2020-05-04-113741: $ openshift-install create cluster INFO Consuming OpenShift Install (Manifests) from target directory <SNIP> INFO Install complete! $ oc get all -n calico-system NAME READY STATUS RESTARTS AGE pod/calico-kube-controllers-558b5bb4fc-n82wz 1/1 Running 0 22m pod/calico-node-2m4n8 1/1 Running 0 22m pod/calico-node-55wld 1/1 Running 0 11m pod/calico-node-6ddqf 1/1 Running 0 22m pod/calico-node-fzxfj 1/1 Running 0 22m pod/calico-node-p7wlt 1/1 Running 0 11m pod/calico-node-zjtpg 1/1 Running 0 11m pod/calico-typha-759d74b7d9-cf7td 1/1 Running 0 20m pod/calico-typha-759d74b7d9-f6rd5 1/1 Running 0 22m pod/calico-typha-759d74b7d9-j28ff 1/1 Running 0 10m pod/calico-typha-759d74b7d9-rh6hf 1/1 Running 0 20m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/calico-typha ClusterIP 172.30.250.232 <none> 5473/TCP 22m NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/calico-node 6 6 6 6 6 <none> 22m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/calico-kube-controllers 1/1 1 1 22m deployment.apps/calico-typha 4/4 4 4 22m NAME DESIRED CURRENT READY AGE replicaset.apps/calico-kube-controllers-558b5bb4fc 1 1 1 22m replicaset.apps/calico-typha-759d74b7d9 4 4 4 22m Note: I did not select the optional step in the Calico install guide. Steps for reproducing the bz is needed. [1] https://github.com/openshift/cluster-dns-operator/blob/master/pkg/operator/controller/controller.go#L378-L399 [2] https://github.com/openshift/cluster-dns-operator/blob/master/pkg/operator/controller/controller_dns_service.go#L72-L74 [3] https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/operators.md#how-do-i-get-added-as-a-special-run-level [4] https://github.com/openshift/cluster-dns-operator/tree/master/manifests [5] https://github.com/operator-framework/operator-marketplace/tree/master/manifests [6] https://docs.projectcalico.org/getting-started/openshift/installation > Could you please revise the dns-operator to handle such a case?
If we allow the DNS service IP to be arbitrary, I'm confident cluster operators (the human kind), app devs, etc are going to be very unhappy. <service_cidr>.10 has been used for DNS a long time by OCP and upstream k8s installs (kubeadm). Having a consistent IP for key infra endpoints (apiserver, default gateways, DNS, etc.) simplifies management and troubleshooting. I think we need to figure out a way to continue using .10. Since k8s does not provide a mechanism for reserving an IP from the service cidr, let me see if we can create the DNS service as part of the initial install.
Blocked waiting for feedback [1] regarding the installer's ability to create the DNS Service early during the install process to ensure DNS always gets the <service_cidr>.10 address. [1] https://coreos.slack.com/archives/C68TNFWA2/p1589563877199300 Moving to 4.6. I’m adding UpcomingSprint because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. @Frank, The dns operator reserves the <SERVICE_CIDR>.10 address (see https://bugzilla.redhat.com/show_bug.cgi?id=1813062#c16 [1] for details). This address is reserved because the kubelet is configured by MCO with the same <SERVICE_CIDR>.10. Making this address configurable in the dns operator is a breaking change. Please create an RFE or open an MCO bug that references this BZ for additional background. [1] https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/network/dns/dns.go [2] https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/render.go#L114 [3] https://github.com/openshift/machine-config-operator/blob/master/templates/master/01-master-kubelet/_base/files/kubelet.yaml#L15 Hi, I've got another case 02666014 that appears to be related to this issue. Are there any updates on the QA process that I can relay to the cu? Brandon, Note that the PR waiting to be QE'd is to surface dns operator status conditions for this use case and does not fix the underlying issue. As previously mentioned, the <service_cidr>.10 IP is required for the DNS Service IP. For standard installations this is not an issue due to components following CVO run-levels: https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/operators.md#how-do-i-get-added-as-a-special-run-level verified with 4.6.0-0.nightly-2020-08-27-005538 and passed. test step: 1. disable dns operator temporarily $ oc -n openshift-dns-operator scale deploy/dns-operator --replicas=0 2. delete the existing dns-default service that taking the IP 172.30.0.10. $ oc -n openshift-dns delete svc dns-default 3. create another test service that taking the IP 172.30.0.10 $ oc create service clusterip test-dns --tcp=53:53 --clusterip="172.30.0.10" 4. enable dns operator $ oc -n openshift-dns-operator scale deploy/dns-operator --replicas=1 5. check the dns operator and it should be degraded and shows message "No IP assigned to DNS service". $ oc get co/dns NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE dns 4.6.0-0.nightly-2020-08-27-005538 False True True 95s $ oc get dnses.operator.openshift.io default -oyaml apiVersion: operator.openshift.io/v1 kind: DNS <---snip---> spec: {} status: clusterDomain: cluster.local clusterIP: "" conditions: - lastTransitionTime: "2020-08-28T01:54:47Z" message: No IP assigned to DNS service reason: NoServiceIP status: "True" type: Degraded - lastTransitionTime: "2020-08-28T01:54:47Z" message: No IP assigned to DNS service reason: Reconciling status: "True" type: Progressing - lastTransitionTime: "2020-08-28T01:54:47Z" message: No IP assigned to DNS service reason: NoServiceIP status: "False" type: Available > To note, I *do* see the upstream discussion on this as having the "fix" option just be to allow the DNS operator to fallback and use a different address if .10 is taken (I presume that's behind the direction that you were going with this), but is that the only realistic option? I'd think that it'd be a difficult case to make since our concern is a bit nebulous (we don't know what/who having a non-.10 DNS address might break). Other options exist, but require an RFE. At this point, .10 is required for the DNS service IP. > As noted, we can split this to a new bz/issue but the client hits this with some regularity and this bz shifted tack about mid-way from "making it work" to fixing the error handling (which is useful, but the cluster still doesn't reliably build). I'm trying to see how to best go about solving that. We need to be provided a reproducer to consider other alternatives. It would be helpful to get Calico involved since the issue is related to running OCP with their networking plugin. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days |