Bug 1865806
| Summary: | Etcd upgrade fails with DNS clash | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Mangirdas Judeikis <mjudeiki> | |
| Component: | Etcd Operator | Assignee: | Dan Mace <dmace> | |
| Status: | CLOSED ERRATA | QA Contact: | Mike Fiedler <mifiedle> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 4.4 | CC: | amcdermo, awestbro, bbennett, dhansen, dmace, ehashman, geliu, jhunter, jmalde, jminter, mgahagan, mifiedle, mjudeiki, mmasters, sdodson, wking | |
| Target Milestone: | --- | Keywords: | Upgrades | |
| Target Release: | 4.5.z | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1869681 (view as bug list) | Environment: | ||
| Last Closed: | 2020-09-08 10:54:03 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1869681 | |||
A quick update. During the 4.4 upgrade, OpenShift is trying to migrate etcd away from relying on DNS, and in doing so tries to look up etcd DNS records created in the cluster's private DNS zone to find the IPs to replace the DNS addresses. However, the DNS lookup flows through CoreDNS, which in these clusters is configured to forward the requests to an external nameserver which is missing those DNS records present in the private zone. Transparent immediate fixes include: 1. Configure the forwarded nameserver to delegate to the private zone containing the records. 2. Copy the records from the private zone to the forwarded zone. Semantically this seems reasonable because the forwarded nameserver declares authority for the domain and so should be able to answer those records. Practically speaking it may not be reasonable for the end-users to make these changes. We're currently exploring some plausible solutions that don't require a manual intervention and hope to have an update as soon as possible. I have different concern here. What we need to do to make sure any other component starting from 4.4 will not fall over this pitfall moving forward. Customer configuration: * Cluster using custom Domain redhat.com (azure privateDNS zone is being created with these records and being served by Azure native DNS on Host level) * Customer is using forwardPluging for redhat.com to 1.1.1.1 (or any other custom dns) * Any component, running in the cluster and using cluster DNS will pontially will not be able to call routes, API, etc. These components should either use .cluster reference or use hostNetwork/DNS. How would be best to track this work item? @sdobson, @dmace, your input would be great so I can do the what is needed. (In reply to Mangirdas Judeikis from comment #5) > I have different concern here. > What we need to do to make sure any other component starting from 4.4 will > not fall over this pitfall moving forward. > > Customer configuration: > * Cluster using custom Domain redhat.com (azure privateDNS zone is being > created with these records and being served by Azure native DNS on Host > level) > * Customer is using forwardPluging for redhat.com to 1.1.1.1 (or any other > custom dns) > * Any component, running in the cluster and using cluster DNS will pontially > will not be able to call routes, API, etc. These components should either > use .cluster reference or use hostNetwork/DNS. > > How would be best to track this work item? @sdobson, @dmace, your input > would be great so I can do the what is needed. CC'ing the Network Edge folks. My first intuition is that Ben's earlier assessment applies here: if the administrator sets up a global forward for a domain that's declared to be managed by openshift (e.g. the ingress domain), that upstream nameserver had better actually be authoritative for the domain. Perhaps there's a documentation issue in this regard — the nameserver behind such a rule would need to delegate to the OpenShift-managed zones for which authority is declared, because OpenShift can't manage records in the opaque user-defined upstream. At a glance this seems like a fairly reasonable expectation, but I'm sure it's possible there's more nuance here I'm failing to consider. At the very least I would hope we can use the documentation to warn users about the potential footguns associated with DNS forwarding. I'm not sure if an alert of some kind would be appropriate. Curious to get some feedback from the NE folks on this. It's a reasonable line of questioning and it does seem clear there are implications to DNS forwarding that weren't considered when the feature was originally designed, and I'm glad the feature is getting used and appreciate the feedback. This bugzilla might not be the best place to hash out the details, but it's a start; if there's a better discussion venue I'm happy to take the conversation there. I'd expect the same problem to apply to the API; would it not? Did the customer copy the api-int records to the private zone and not the etcd records? If so, this would imply that the customer knew to follow (and possibly invent) some process, and the process had a step to create the api-int records but not to create the etcd records, so one solution would be to make sure this process is explicitly documented and complete. If an arbitrary component needs a DNS record in the private zone, it would make sense to have an alert for that component if the name on the record doesn't resolve. The alert could suggest checking the DNS forwarding configuration. In retrospect, it might have been best if we'd prevented overriding name resolution for the cluster domain, but it's too late to change that now. https://bugzilla.redhat.com/show_bug.cgi?id=1867205 created to update DNS forwarding docs. Please note that the scope of this fix is limited to etcd errors during upgrade. If etcd upgrades, the fix works, and any followon issues should be treated separately unless there's a reason to believe the issues are caused by this fix.
To test, I used the following procedures.
### Verify the problem
1. Launch a 4.3.31 IPI cluster on Azure.
2. Edit the `dnses.operator.openshift.io/default` with the the following `spec` field:
servers:
- forwardPlugin:
upstreams:
- 1.1.1.1
name: external-dns
zones:
- ci-ln-2pmxhgk-002ac.ci.azure.devcluster.openshift.com
The `zones` field should match the cluster domain.
3. Upgrade the cluster to a stable 4.4 release:
oc adm upgrade --force --allow-upgrade-with-warnings --allow-explicit-upgrade --to-image quay.io/openshift-release-dev/ocp-release:4.4.16-x86_64
4. Verify the upgrade fails due to the etcd operator becoming degraded due to DNS lookup failures:
$ oc get clusterversion/version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.3.31 True True 10m Unable to apply 4.4.16: the cluster operator etcd is degraded
$ oc get clusteroperators/etcd -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
creationTimestamp: "2020-08-10T14:39:34Z"
generation: 1
name: etcd
resourceVersion: "25429"
selfLink: /apis/config.openshift.io/v1/clusteroperators/etcd
uid: 91438145-ab1a-4717-800f-57007cda0a72
spec: {}
status:
conditions:
- lastTransitionTime: "2020-08-10T14:41:34Z"
message: 'EtcdMemberIPMigratorDegraded: unable to locate a node for peerURL=https://etcd-1.ci-ln-2pmxhgk-002ac.ci.azure.devcluster.openshift.com:2380,dnsErrors=lookup _etcd-server-ssl._tcp.ci-ln-2pmxhgk-002ac.ci.azure.devcluster.openshift.com on 172.30.0.10:53: no such host'
reason: EtcdMemberIPMigrator_Error
status: "True"
type: Degraded
### Verify the fix
1. Launch a 4.3.31 IPI cluster on Azure.
2. Edit the `dnses.operator.openshift.io/default` with the the following `spec` field:
servers:
- forwardPlugin:
upstreams:
- 1.1.1.1
name: external-dns
zones:
- ci-ln-gk8wrgt-002ac.ci.azure.devcluster.openshift.com
The `zones` field should match the cluster domain.
3. Upgrade the cluster to a 4.4 release image containing the fix:
oc adm upgrade --force --allow-upgrade-with-warnings --allow-explicit-upgrade --to-image registry.svc.ci.openshift.org/ci-ln-m17603t/release:latest
4. Verify the that etcd successfully upgrades.
$ oc get clusteroperators/etcd -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
creationTimestamp: "2020-08-07T15:27:54Z"
generation: 1
name: etcd
resourceVersion: "28214"
selfLink: /apis/config.openshift.io/v1/clusteroperators/etcd
uid: c1ef4c59-4a5a-49fd-85d7-7787f1a3f058
spec: {}
status:
conditions:
- lastTransitionTime: "2020-08-07T15:32:18Z"
message: |-
NodeControllerDegraded: All master nodes are ready
EtcdMembersDegraded: No unhealthy members found
reason: AsExpected
status: "False"
type: Degraded
- lastTransitionTime: "2020-08-07T15:32:42Z"
message: |-
NodeInstallerProgressing: 3 nodes are at revision 2
EtcdMembersProgressing: No unstarted etcd members found
reason: AsExpected
status: "False"
type: Progressing
- lastTransitionTime: "2020-08-07T15:29:53Z"
message: |-
StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 2
EtcdMembersAvailable: 3 members are available
reason: AsExpected
status: "True"
type: Available
- lastTransitionTime: "2020-08-07T15:27:54Z"
reason: AsExpected
status: "True"
type: Upgradeable
5. Double-check that the expected events were produced indicating the new fallback logic was executed:
101s Warning MemberIPLookupFailed deployment/etcd-operator member "etcd-member-ci-ln-gk8wrgt-002ac-dmd9h-master-0" IP couldn't be determined via DNS: unableto locate a node for peerURL=https://etcd-0.ci-ln-gk8wrgt-002ac.ci.azure.devcluster.openshift.com:2380, dnsErrors=lookup _etcd-server-ssl._tcp.ci-ln-gk8wrgt-002ac.ci.azure.devcluster.openshift.com on 172.30.0.10:53: no such host; will attempt a fallback lookup
101s Normal MemberSettingIPPeer deployment/etcd-operator member "etcd-member-ci-ln-gk8wrgt-002ac-dmd9h-master-0"; new peer list https://10.0.0.6:2380
101s Normal MemberUpdate deployment/etcd-operator updating member "etcd-member-ci-ln-gk8wrgt-002ac-dmd9h-master-0" with peers https://10.0.0.6:2380
101s Normal MemberMissingIPPeer deployment/etcd-operator member "etcd-member-ci-ln-gk8wrgt-002ac-dmd9h-master-1" is missing an IP in the peer list
88s Warning MemberIPLookupFailed deployment/etcd-operator member "etcd-member-ci-ln-gk8wrgt-002ac-dmd9h-master-1" IP couldn't be determined via DNS: unableto locate a node for peerURL=https://etcd-1.ci-ln-gk8wrgt-002ac.ci.azure.devcluster.openshift.com:2380, dnsErrors=lookup _etcd-server-ssl._tcp.ci-ln-gk8wrgt-002ac.ci.azure.devcluster.openshift.com on 172.30.0.10:53: no such host; will attempt a fallback lookup
88s Normal MemberSettingIPPeer deployment/etcd-operator member "etcd-member-ci-ln-gk8wrgt-002ac-dmd9h-master-1"; new peer list https://10.0.0.5:2380
88s Normal MemberUpdate deployment/etcd-operator updating member "etcd-member-ci-ln-gk8wrgt-002ac-dmd9h-master-1" with peers https://10.0.0.5:2380
88s Normal MemberMissingIPPeer deployment/etcd-operator member "etcd-member-ci-ln-gk8wrgt-002ac-dmd9h-master-2" is missing an IP in the peer list
76s Warning MemberIPLookupFailed deployment/etcd-operator member "etcd-member-ci-ln-gk8wrgt-002ac-dmd9h-master-2" IP couldn't be determined via DNS: unableto locate a node for peerURL=https://etcd-2.ci-ln-gk8wrgt-002ac.ci.azure.devcluster.openshift.com:2380, dnsErrors=lookup _etcd-server-ssl._tcp.ci-ln-gk8wrgt-002ac.ci.azure.devcluster.openshift.com on 172.30.0.10:53: no such host; will attempt a fallback lookup
76s Normal MemberSettingIPPeer deployment/etcd-operator member "etcd-member-ci-ln-gk8wrgt-002ac-dmd9h-master-2"; new peer list https://10.0.0.7:2380
76s Normal MemberUpdate deployment/etcd-operator updating member "etcd-member-ci-ln-gk8wrgt-002ac-dmd9h-master-2" with peers https://10.0.0.7:2380
The `MemberIPLookupFailed` errors are what trigger the fallback logic (notice the "will attempt a fallback lookup" message). The `MemberSettingIPPeer` and `MemberUpdate` are signals that the DNS workaround was successful and new IPs assigned to the members.
Mangirdas, Can your team test this independently? It's easy for us to build a custom 4.4 image containing the fix. *** Bug 1864436 has been marked as a duplicate of this bug. *** @Dan, We can try if this still needed. Is this new release image or etcd-operator image? MJ (In reply to Mangirdas Judeikis from comment #12) > @Dan, > > We can try if this still needed. Is this new release image or etcd-operator > image? > > MJ You can use @cluster-bot to build an image from https://github.com/openshift/cluster-etcd-operator/pull/419 to which you can upgrade a cluster. Please feel free to reach out directly if you need any help working through that process. I a bit puzzeld.
Tested old behaviour and new and upgrade succeded. But I was not able to observe MemberIPLookupFailed or MemberSettingIPPeer.
Result positive, but not sure where the events gone...
Standard upgrade:
1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"720dfeac-a1a0-4dd1-bcdf-c37d3669d9cc", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'MemberMissingIPPeer' member "etcd-member-mjudeikis-j269c-master-2" is missing an IP in the peer list
E0814 10:20:10.823407 1 etcdmemberipmigrator.go:314] key failed with : unable to locate a node for peerURL=https://etcd-2.xw1rnv4j.v4-eastus.osadev.cloud:2380, dnsErrors=lookup _etcd-server-ssl._tcp.xw1rnv4j.v4-eastus.osadev.cloud on 172.30.0.10:53: no such host
I0814 10:20:11.928081 1 request.go:621] Throttling request took 1.181303068s, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/installer-2-mjudeikis-j269c-master-2
I0814 10:20:12.713978 1 etcdcli.go:96] service/host-etcd-2 is missing annotation alpha.installer.openshift.io/etcd-bootstrap
I0814 10:20:12.744864 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"720dfeac-a1a0-4dd1-bcdf-c37d3669d9cc", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/installer-2-mjudeikis-j269c-master-2 -n openshift-etcd because it was missing
I0814 10:20:12.745975 1 etcdcli.go:96] service/host-etcd-2 is missing annotation alpha.installer.openshift.io/etc
Events from successful upgrade:
0s Normal OperatorStatusChanged deployment/etcd-operator Status for clusteroperator/etcd changed: Degraded message changed from "EtcdMemberIPMigratorDegraded: unable to locate a node for peerURL=https://etcd-2.xw1rnv4j.v4-eastus.osadev.cloud:2380, dnsErrors=lookup _etcd-serv
er-ssl._tcp.xw1rnv4j.v4-eastus.osadev.cloud on 172.30.0.10:53: no such host" to "EtcdMembersControllerDegraded: node lister not synced\nEtcdMemberIPMigratorDegraded: unable to locate a node for peerURL=https://etcd-2.xw1rnv4j.v4-eastus.osadev.cloud:2380, dnsErrors=lookup _etcd-server-ssl._tcp.xw1rnv4j.v4-eastus.osade
v.cloud on 172.30.0.10:53: no such host"
90s Normal OperatorStatusChanged deployment/etcd-operator Status for clusteroperator/etcd changed: Degraded message changed from "EtcdMembersControllerDegraded: node lister not synced\nEtcdMemberIPMigratorDegraded: unable to locate a node for peerURL=https://etcd-2.xw1rnv4j.v
4-eastus.osadev.cloud:2380, dnsErrors=lookup _etcd-server-ssl._tcp.xw1rnv4j.v4-eastus.osadev.cloud on 172.30.0.10:53: no such host" to "EtcdMembersControllerDegraded: node lister not synced\nBootstrapTeardownDegraded: node lister not synced\nEtcdMemberIPMigratorDegraded: unable to locate a node for peerURL=https://et
cd-2.xw1rnv4j.v4-eastus.osadev.cloud:2380, dnsErrors=lookup _etcd-server-ssl._tcp.xw1rnv4j.v4-eastus.osadev.cloud on 172.30.0.10:53: no such host"
89s Normal OperatorStatusChanged deployment/etcd-operator Status for clusteroperator/etcd changed: Degraded message changed from "EtcdMembersControllerDegraded: node lister not synced\nBootstrapTeardownDegraded: node lister not synced\nEtcdMemberIPMigratorDegraded: unable to
locate a node for peerURL=https://etcd-2.xw1rnv4j.v4-eastus.osadev.cloud:2380, dnsErrors=lookup _etcd-server-ssl._tcp.xw1rnv4j.v4-eastus.osadev.cloud on 172.30.0.10:53: no such host" to "EtcdMembersControllerDegraded: node lister not synced\nBootstrapTeardownDegraded: node lister not synced\nEtcdMemberIPMigratorDegra
ded: unable to locate a node for peerURL=https://etcd-2.xw1rnv4j.v4-eastus.osadev.cloud:2380, dnsErrors=lookup _etcd-server-ssl._tcp.xw1rnv4j.v4-eastus.osadev.cloud on 172.30.0.10:53: no such host\nClusterMemberControllerDegraded: node lister not synced"
89s Normal OperatorStatusChanged deployment/etcd-operator Status for clusteroperator/etcd changed: Degraded changed from True to False ("EtcdMembersControllerDegraded: node lister not synced\nBootstrapTeardownDegraded: node lister not synced\nNodeControllerDegraded: All maste
r nodes are ready\nClusterMemberControllerDegraded: node lister not synced\nEtcdMembersDegraded: No unhealthy members found")
88s Normal OperatorStatusChanged deployment/etcd-operator Status for clusteroperator/etcd changed: Degraded message changed from "EtcdMembersControllerDegraded: node lister not synced\nBootstrapTeardownDegraded: node lister not synced\nNodeControllerDegraded: All master nodes
are ready\nClusterMemberControllerDegraded: node lister not synced\nEtcdMembersDegraded: No unhealthy members found" to "EtcdMembersControllerDegraded: node lister not synced\nBootstrapTeardownDegraded: node lister not synced\nNodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy memb
ers found"
88s Normal OperatorStatusChanged deployment/etcd-operator Status for clusteroperator/etcd changed: Degraded message changed from "EtcdMembersControllerDegraded: node lister not synced\nBootstrapTeardownDegraded: node lister not synced\nNodeControllerDegraded: All master nodes
are ready\nEtcdMembersDegraded: No unhealthy members found" to "EtcdMembersControllerDegraded: node lister not synced\nNodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"
88s Normal OperatorStatusChanged deployment/etcd-operator Status for clusteroperator/etcd changed: Degraded message changed from "EtcdMembersControllerDegraded: node lister not synced\nNodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy membe
rs found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"
84s Normal ConfigMapUpdated deployment/etcd-operator Updated ConfigMap/etcd-pod -n openshift-etcd:
cause by changes in data.pod.yaml,data.version
84s Normal RevisionTriggered deployment/etcd-operator new revision 3 triggered by "configmap/etcd-pod has changed"
Did more testing. All good 82s Warning MemberIPLookupFailed deployment/etcd-operator member "mjudeikis-ggpdf-master-1" IP couldn't be determined via DNS: unable to locate a node for peerURL=https://etcd-1.iy5kl2c0.v4-westeurope.osadev.cloud:2380, dnsErrors=lookup _etcd-server-ssl._tcp.iy5kl2c0.v4-westeurope.os adev.cloud on 172.30.0.10:53: no such host; will attempt a fallback lookup 82s Normal MemberSettingIPPeer deployment/etcd-operator member "mjudeikis-ggpdf-master-1"; new peer list https://10.5.0.8:2380 And result: sh-4.2# etcdctl member list 7c1c361f137ec111, started, mjudeikis-ggpdf-master-2, https://10.5.0.9:2380, https://10.5.0.9:2379 8099625059b34b7a, started, mjudeikis-ggpdf-master-0, https://10.5.0.7:2380, https://10.5.0.7:2379 aa431bcae9995c87, started, mjudeikis-ggpdf-master-1, https://10.5.0.8:2380, https://10.5.0.8:2379 This bug doesn't affect 4.5+ upgrades; I cloned https://bugzilla.redhat.com/show_bug.cgi?id=1869681 to track the 4.4.z work. @dmace @mj This bug moved to POST->MODIFIED->ON_QA but I don't see a PR attached for 4.5. Does QE need to verify this for 4.5.z? or just CLOSE it. It is currently blocking the merge of https://github.com/openshift/cluster-etcd-operator/pull/419 The bug only applies to 4.3 -> 4.4 upgrades. There is no bug to fix in the 4.5 release, and so there will be no 4.5 PR. This bug exists only to allow the 4.4 PR to merge and satisfy the overall process. (In reply to Dan Mace from comment #20) > This bug exists only to allow the 4.4 PR to merge and satisfy the overall process. I mean that the 4.5 bug only exists to satisfy process. This bug, for 4.4, makes sense because the fix is delivered in a PR against the 4.4 branch. Hope that clarifies! Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5.8 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3510 |
Description of problem: Customer in Azure configure customer hostnames. This ends in custom DNS records created in privateZone records. If this hostname clashes with customer forward dns zones - cluster and etcd is in unhealthy state. System components should not rely on CoreDNS for zone resolution and rely on system provided dns settings on the nodes. Details situation description: Part 1. Install. Customer with hostname shared-cluster.osadev.cloud creates a cluster in Azure providing his domain to intaller. _etcd-server-ssl._tcp SRV points etcd-0.shared-cluster.osadev.cloud All other records corresponds to IP addresses. Customer configure ROOT DNS zone with records shared-cluster.osadev.cloud pointing to child zone. Cluster installs and is in a healthy state. Part 2. Day 2. Customer wants his custom domain be forwarded to custom DNS server, he/she configures CoreDNS with: servers: - forwardPlugin: upstreams: - 10.x.y.4 name: external-dns zones: - osadev.cloud Customer pods are healthy and they can resolve external hostnames, Part 3. Upgrade. Once cluster upgrade was initiated ETCD upgrade failed with errors: EtcdMemberIPMigratorDegraded: unable to locate a node for peerURL=https://etcd-2.shared-cluster.osadev.cloud:2380, dnsErrors=lookup _etcd-server-ssl._tcp.shared-cluster.osadev.cloud on 172.30.0.10:53: no such host I think system components should not rely on CoreDNS forward configuration due to reasons that it might break existing DNS patterns. Version-Release number of selected component (if applicable): 4.3+ Actual results: Upgrade fails and stuck Expected results: Upgrade successful. Additional info: I believe all Azure clusters with custom hostname set will be ipacted.