Description of problem: apiserver pods were in a CrashLoopBackoff state: $ oc get pods -n openshift-apiserver NAME READY STATUS RESTARTS AGE apiserver-78vlp 0/1 CrashLoopBackOff 11 36m apiserver-kvmq6 0/1 CrashLoopBackOff 11 36m apiserver-zw4kx 0/1 CrashLoopBackOff 11 36m Logs for the apiserver pods: W0226 21:42:15.234536 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {etcd.openshift-etcd.svc:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: operation was canceled". Reconnecting... W0226 21:42:15.234556 1 asm_amd64.s:1337] Failed to dial etcd.openshift-etcd.svc:2379: grpc: the connection is closing; please retry. F0226 21:42:15.234539 1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 openshift.io {[https://etcd.openshift-etcd.svc:2379] /var/run/secrets/etcd-client/tls.key /var/run/secrets/etcd-client/tls.crt /var/run/configmaps/etcd-serving-ca/ca-bundle.crt} false true {0xc000ebfe60 0xc000ebfef0} {{apps.openshift.io v1} [{apps.openshift.io } {apps.openshift.io }] false} <nil> 5m0s 1m0s}), err (context deadline exceeded) $ oc4 get events -n openshift-apiserver LAST SEEN TYPE REASON OBJECT MESSAGE 43m Normal Scheduled pod/apiserver-78vlp Successfully assigned openshift-apiserver/apiserver-78vlp to domain-name 43m Warning FailedScheduling pod/apiserver-78vlp Binding rejected: Operation cannot be fulfilled on pods/binding "apiserver-78vlp": pod apiserver-78vlp is already assigned to node "domain-name" We created a debug pod and attempted to curl etcd and could not reach it: $ oc4 debug ds/apiserver -n openshift-apiserver ... sh-4.2# curl -vk https://etcd.openshift-etcd.svc:2379 * Could not resolve host: etcd.openshift-etcd.svc; Unknown error * Closing connection 0 curl: (6) Could not resolve host: etcd.openshift-etcd.svc; Unknown error This issue was solved by inspecting the DNS operator logs: openshift-dns-operator ---------------------- 2020-02-27T21:24:19.449200823Z time="2020-02-27T21:24:19Z" level=error msg="failed to reconcile request /default: failed to ensure dns default: failed to create service for dns default: failed to create dns service: Service \"dns-default\" is invalid: spec.clusterIP: Invalid value: \"10.253.159.10\": provided IP is already allocated" Then having the customer check to see if IP was already in use, which we found that it was. We had the customer remove the conflicting service, and the apiserver pods went into a running state. We believe that it must have been a timing issue during the original installation or the Service was deleted at some point and another Service pending creation quickly snagged the IP. Version-Release number of selected component (if applicable): OpenShift 4.2.z How reproducible: This occurred on 2 out of 5 clusters within customers infrastructure. Additional info: One thing to note is that the customer did tweak the SDN and Service CIDR's, they also installed with Calico plugin. See below: oc3 describe network.config/cluster ----------------------------------- Name: cluster Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: Network Metadata: Creation Timestamp: 2020-02-28T02:30:18Z Generation: 2 Resource Version: 2345 Self Link: /apis/config.openshift.io/v1/networks/cluster UID: 4281aa32-59d2-11ea-98dc-005056bfc1a6 Spec: Cluster Network: Cidr: 10.253.144.0/23 Host Prefix: 26 External IP: Policy: Network Type: Calico Service Network: 10.253.158.0/24 Status: Cluster Network: Cidr: 10.253.144.0/23 Host Prefix: 26 Cluster Network MTU: 1410 Network Type: Calico Service Network: 10.253.158.0/24 Events: <none>
Seems unlikely this is a regression, so I'm moving it out of the 4.4 release.
network.config/cluster shows a service cidr of 10.253.158.0/24. This means available host addresses are 10.253.158.1-.254. The service IP assigned to the default cluster dns service is 10.253.159.10, which is outside of the service cidr scope. This appears to be a the results of a degraded state caused by a network.config/cluster misconfiguration.
SDN IPAM should not be allocating the <SERVICE_CIDR>.10 address. This address is reserved for the DNS service IP. I see the Calico plugin is in play, so maybe a static IP was assigned [1] to the openshift-marketplace service? Reassigning to the SDN team to provide input. [1] https://docs.projectcalico.org/v3.10/networking/use-specific-ip
Hi > One thing to note is that the customer did tweak the SDN and Service CIDR's, they also installed with Calico plugin. Tweaking the service CIDR is not supported post cluster installation. This could the reason for the IP collision. I will have to close this bug as WONTFIX. -Alex
NodeIPAMController although it's part of kube-controller-manager is under the networking team, so moving accordingly.
The dns operator allocates the 10th IP from the network config serviceCIDR [1] to the DNS service cluster IP [2]. The marketplace operator should not be getting installed before the dns operator, see [3][4][5] for details. It does not appear that the marketplace operator follows the run level schema identified in [3]. However, I followed the Calico install guide [6] for OCP and had no issue installing a cluster using 4.5.0-0.nightly-2020-05-04-113741: $ openshift-install create cluster INFO Consuming OpenShift Install (Manifests) from target directory <SNIP> INFO Install complete! $ oc get all -n calico-system NAME READY STATUS RESTARTS AGE pod/calico-kube-controllers-558b5bb4fc-n82wz 1/1 Running 0 22m pod/calico-node-2m4n8 1/1 Running 0 22m pod/calico-node-55wld 1/1 Running 0 11m pod/calico-node-6ddqf 1/1 Running 0 22m pod/calico-node-fzxfj 1/1 Running 0 22m pod/calico-node-p7wlt 1/1 Running 0 11m pod/calico-node-zjtpg 1/1 Running 0 11m pod/calico-typha-759d74b7d9-cf7td 1/1 Running 0 20m pod/calico-typha-759d74b7d9-f6rd5 1/1 Running 0 22m pod/calico-typha-759d74b7d9-j28ff 1/1 Running 0 10m pod/calico-typha-759d74b7d9-rh6hf 1/1 Running 0 20m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/calico-typha ClusterIP 172.30.250.232 <none> 5473/TCP 22m NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/calico-node 6 6 6 6 6 <none> 22m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/calico-kube-controllers 1/1 1 1 22m deployment.apps/calico-typha 4/4 4 4 22m NAME DESIRED CURRENT READY AGE replicaset.apps/calico-kube-controllers-558b5bb4fc 1 1 1 22m replicaset.apps/calico-typha-759d74b7d9 4 4 4 22m Note: I did not select the optional step in the Calico install guide. Steps for reproducing the bz is needed. [1] https://github.com/openshift/cluster-dns-operator/blob/master/pkg/operator/controller/controller.go#L378-L399 [2] https://github.com/openshift/cluster-dns-operator/blob/master/pkg/operator/controller/controller_dns_service.go#L72-L74 [3] https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/operators.md#how-do-i-get-added-as-a-special-run-level [4] https://github.com/openshift/cluster-dns-operator/tree/master/manifests [5] https://github.com/operator-framework/operator-marketplace/tree/master/manifests [6] https://docs.projectcalico.org/getting-started/openshift/installation
> Could you please revise the dns-operator to handle such a case? If we allow the DNS service IP to be arbitrary, I'm confident cluster operators (the human kind), app devs, etc are going to be very unhappy. <service_cidr>.10 has been used for DNS a long time by OCP and upstream k8s installs (kubeadm). Having a consistent IP for key infra endpoints (apiserver, default gateways, DNS, etc.) simplifies management and troubleshooting. I think we need to figure out a way to continue using .10. Since k8s does not provide a mechanism for reserving an IP from the service cidr, let me see if we can create the DNS service as part of the initial install.
Blocked waiting for feedback [1] regarding the installer's ability to create the DNS Service early during the install process to ensure DNS always gets the <service_cidr>.10 address. [1] https://coreos.slack.com/archives/C68TNFWA2/p1589563877199300
Moving to 4.6.
I’m adding UpcomingSprint because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
@Frank, The dns operator reserves the <SERVICE_CIDR>.10 address (see https://bugzilla.redhat.com/show_bug.cgi?id=1813062#c16 [1] for details). This address is reserved because the kubelet is configured by MCO with the same <SERVICE_CIDR>.10. Making this address configurable in the dns operator is a breaking change. Please create an RFE or open an MCO bug that references this BZ for additional background. [1] https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/network/dns/dns.go [2] https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/render.go#L114 [3] https://github.com/openshift/machine-config-operator/blob/master/templates/master/01-master-kubelet/_base/files/kubelet.yaml#L15
Hi, I've got another case 02666014 that appears to be related to this issue. Are there any updates on the QA process that I can relay to the cu?
Brandon, Note that the PR waiting to be QE'd is to surface dns operator status conditions for this use case and does not fix the underlying issue. As previously mentioned, the <service_cidr>.10 IP is required for the DNS Service IP. For standard installations this is not an issue due to components following CVO run-levels: https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/operators.md#how-do-i-get-added-as-a-special-run-level
verified with 4.6.0-0.nightly-2020-08-27-005538 and passed. test step: 1. disable dns operator temporarily $ oc -n openshift-dns-operator scale deploy/dns-operator --replicas=0 2. delete the existing dns-default service that taking the IP 172.30.0.10. $ oc -n openshift-dns delete svc dns-default 3. create another test service that taking the IP 172.30.0.10 $ oc create service clusterip test-dns --tcp=53:53 --clusterip="172.30.0.10" 4. enable dns operator $ oc -n openshift-dns-operator scale deploy/dns-operator --replicas=1 5. check the dns operator and it should be degraded and shows message "No IP assigned to DNS service". $ oc get co/dns NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE dns 4.6.0-0.nightly-2020-08-27-005538 False True True 95s $ oc get dnses.operator.openshift.io default -oyaml apiVersion: operator.openshift.io/v1 kind: DNS <---snip---> spec: {} status: clusterDomain: cluster.local clusterIP: "" conditions: - lastTransitionTime: "2020-08-28T01:54:47Z" message: No IP assigned to DNS service reason: NoServiceIP status: "True" type: Degraded - lastTransitionTime: "2020-08-28T01:54:47Z" message: No IP assigned to DNS service reason: Reconciling status: "True" type: Progressing - lastTransitionTime: "2020-08-28T01:54:47Z" message: No IP assigned to DNS service reason: NoServiceIP status: "False" type: Available
> To note, I *do* see the upstream discussion on this as having the "fix" option just be to allow the DNS operator to fallback and use a different address if .10 is taken (I presume that's behind the direction that you were going with this), but is that the only realistic option? I'd think that it'd be a difficult case to make since our concern is a bit nebulous (we don't know what/who having a non-.10 DNS address might break). Other options exist, but require an RFE. At this point, .10 is required for the DNS service IP. > As noted, we can split this to a new bz/issue but the client hits this with some regularity and this bz shifted tack about mid-way from "making it work" to fixing the error handling (which is useful, but the cluster still doesn't reliably build). I'm trying to see how to best go about solving that. We need to be provided a reproducer to consider other alternatives. It would be helpful to get Calico involved since the issue is related to running OCP with their networking plugin.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days