Bug 1865806
Summary: | Etcd upgrade fails with DNS clash | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Mangirdas Judeikis <mjudeiki> | |
Component: | Etcd Operator | Assignee: | Dan Mace <dmace> | |
Status: | CLOSED ERRATA | QA Contact: | Mike Fiedler <mifiedle> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 4.4 | CC: | amcdermo, awestbro, bbennett, dhansen, dmace, ehashman, geliu, jhunter, jmalde, jminter, mgahagan, mifiedle, mjudeiki, mmasters, sdodson, wking | |
Target Milestone: | --- | Keywords: | Upgrades | |
Target Release: | 4.5.z | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1869681 (view as bug list) | Environment: | ||
Last Closed: | 2020-09-08 10:54:03 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1869681 |
Description
Mangirdas Judeikis
2020-08-04 08:46:38 UTC
A quick update. During the 4.4 upgrade, OpenShift is trying to migrate etcd away from relying on DNS, and in doing so tries to look up etcd DNS records created in the cluster's private DNS zone to find the IPs to replace the DNS addresses. However, the DNS lookup flows through CoreDNS, which in these clusters is configured to forward the requests to an external nameserver which is missing those DNS records present in the private zone. Transparent immediate fixes include: 1. Configure the forwarded nameserver to delegate to the private zone containing the records. 2. Copy the records from the private zone to the forwarded zone. Semantically this seems reasonable because the forwarded nameserver declares authority for the domain and so should be able to answer those records. Practically speaking it may not be reasonable for the end-users to make these changes. We're currently exploring some plausible solutions that don't require a manual intervention and hope to have an update as soon as possible. I have different concern here. What we need to do to make sure any other component starting from 4.4 will not fall over this pitfall moving forward. Customer configuration: * Cluster using custom Domain redhat.com (azure privateDNS zone is being created with these records and being served by Azure native DNS on Host level) * Customer is using forwardPluging for redhat.com to 1.1.1.1 (or any other custom dns) * Any component, running in the cluster and using cluster DNS will pontially will not be able to call routes, API, etc. These components should either use .cluster reference or use hostNetwork/DNS. How would be best to track this work item? @sdobson, @dmace, your input would be great so I can do the what is needed. (In reply to Mangirdas Judeikis from comment #5) > I have different concern here. > What we need to do to make sure any other component starting from 4.4 will > not fall over this pitfall moving forward. > > Customer configuration: > * Cluster using custom Domain redhat.com (azure privateDNS zone is being > created with these records and being served by Azure native DNS on Host > level) > * Customer is using forwardPluging for redhat.com to 1.1.1.1 (or any other > custom dns) > * Any component, running in the cluster and using cluster DNS will pontially > will not be able to call routes, API, etc. These components should either > use .cluster reference or use hostNetwork/DNS. > > How would be best to track this work item? @sdobson, @dmace, your input > would be great so I can do the what is needed. CC'ing the Network Edge folks. My first intuition is that Ben's earlier assessment applies here: if the administrator sets up a global forward for a domain that's declared to be managed by openshift (e.g. the ingress domain), that upstream nameserver had better actually be authoritative for the domain. Perhaps there's a documentation issue in this regard — the nameserver behind such a rule would need to delegate to the OpenShift-managed zones for which authority is declared, because OpenShift can't manage records in the opaque user-defined upstream. At a glance this seems like a fairly reasonable expectation, but I'm sure it's possible there's more nuance here I'm failing to consider. At the very least I would hope we can use the documentation to warn users about the potential footguns associated with DNS forwarding. I'm not sure if an alert of some kind would be appropriate. Curious to get some feedback from the NE folks on this. It's a reasonable line of questioning and it does seem clear there are implications to DNS forwarding that weren't considered when the feature was originally designed, and I'm glad the feature is getting used and appreciate the feedback. This bugzilla might not be the best place to hash out the details, but it's a start; if there's a better discussion venue I'm happy to take the conversation there. I'd expect the same problem to apply to the API; would it not? Did the customer copy the api-int records to the private zone and not the etcd records? If so, this would imply that the customer knew to follow (and possibly invent) some process, and the process had a step to create the api-int records but not to create the etcd records, so one solution would be to make sure this process is explicitly documented and complete. If an arbitrary component needs a DNS record in the private zone, it would make sense to have an alert for that component if the name on the record doesn't resolve. The alert could suggest checking the DNS forwarding configuration. In retrospect, it might have been best if we'd prevented overriding name resolution for the cluster domain, but it's too late to change that now. https://bugzilla.redhat.com/show_bug.cgi?id=1867205 created to update DNS forwarding docs. Please note that the scope of this fix is limited to etcd errors during upgrade. If etcd upgrades, the fix works, and any followon issues should be treated separately unless there's a reason to believe the issues are caused by this fix. To test, I used the following procedures. ### Verify the problem 1. Launch a 4.3.31 IPI cluster on Azure. 2. Edit the `dnses.operator.openshift.io/default` with the the following `spec` field: servers: - forwardPlugin: upstreams: - 1.1.1.1 name: external-dns zones: - ci-ln-2pmxhgk-002ac.ci.azure.devcluster.openshift.com The `zones` field should match the cluster domain. 3. Upgrade the cluster to a stable 4.4 release: oc adm upgrade --force --allow-upgrade-with-warnings --allow-explicit-upgrade --to-image quay.io/openshift-release-dev/ocp-release:4.4.16-x86_64 4. Verify the upgrade fails due to the etcd operator becoming degraded due to DNS lookup failures: $ oc get clusterversion/version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.31 True True 10m Unable to apply 4.4.16: the cluster operator etcd is degraded $ oc get clusteroperators/etcd -o yaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: creationTimestamp: "2020-08-10T14:39:34Z" generation: 1 name: etcd resourceVersion: "25429" selfLink: /apis/config.openshift.io/v1/clusteroperators/etcd uid: 91438145-ab1a-4717-800f-57007cda0a72 spec: {} status: conditions: - lastTransitionTime: "2020-08-10T14:41:34Z" message: 'EtcdMemberIPMigratorDegraded: unable to locate a node for peerURL=https://etcd-1.ci-ln-2pmxhgk-002ac.ci.azure.devcluster.openshift.com:2380,dnsErrors=lookup _etcd-server-ssl._tcp.ci-ln-2pmxhgk-002ac.ci.azure.devcluster.openshift.com on 172.30.0.10:53: no such host' reason: EtcdMemberIPMigrator_Error status: "True" type: Degraded ### Verify the fix 1. Launch a 4.3.31 IPI cluster on Azure. 2. Edit the `dnses.operator.openshift.io/default` with the the following `spec` field: servers: - forwardPlugin: upstreams: - 1.1.1.1 name: external-dns zones: - ci-ln-gk8wrgt-002ac.ci.azure.devcluster.openshift.com The `zones` field should match the cluster domain. 3. Upgrade the cluster to a 4.4 release image containing the fix: oc adm upgrade --force --allow-upgrade-with-warnings --allow-explicit-upgrade --to-image registry.svc.ci.openshift.org/ci-ln-m17603t/release:latest 4. Verify the that etcd successfully upgrades. $ oc get clusteroperators/etcd -o yaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: creationTimestamp: "2020-08-07T15:27:54Z" generation: 1 name: etcd resourceVersion: "28214" selfLink: /apis/config.openshift.io/v1/clusteroperators/etcd uid: c1ef4c59-4a5a-49fd-85d7-7787f1a3f058 spec: {} status: conditions: - lastTransitionTime: "2020-08-07T15:32:18Z" message: |- NodeControllerDegraded: All master nodes are ready EtcdMembersDegraded: No unhealthy members found reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2020-08-07T15:32:42Z" message: |- NodeInstallerProgressing: 3 nodes are at revision 2 EtcdMembersProgressing: No unstarted etcd members found reason: AsExpected status: "False" type: Progressing - lastTransitionTime: "2020-08-07T15:29:53Z" message: |- StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 2 EtcdMembersAvailable: 3 members are available reason: AsExpected status: "True" type: Available - lastTransitionTime: "2020-08-07T15:27:54Z" reason: AsExpected status: "True" type: Upgradeable 5. Double-check that the expected events were produced indicating the new fallback logic was executed: 101s Warning MemberIPLookupFailed deployment/etcd-operator member "etcd-member-ci-ln-gk8wrgt-002ac-dmd9h-master-0" IP couldn't be determined via DNS: unableto locate a node for peerURL=https://etcd-0.ci-ln-gk8wrgt-002ac.ci.azure.devcluster.openshift.com:2380, dnsErrors=lookup _etcd-server-ssl._tcp.ci-ln-gk8wrgt-002ac.ci.azure.devcluster.openshift.com on 172.30.0.10:53: no such host; will attempt a fallback lookup 101s Normal MemberSettingIPPeer deployment/etcd-operator member "etcd-member-ci-ln-gk8wrgt-002ac-dmd9h-master-0"; new peer list https://10.0.0.6:2380 101s Normal MemberUpdate deployment/etcd-operator updating member "etcd-member-ci-ln-gk8wrgt-002ac-dmd9h-master-0" with peers https://10.0.0.6:2380 101s Normal MemberMissingIPPeer deployment/etcd-operator member "etcd-member-ci-ln-gk8wrgt-002ac-dmd9h-master-1" is missing an IP in the peer list 88s Warning MemberIPLookupFailed deployment/etcd-operator member "etcd-member-ci-ln-gk8wrgt-002ac-dmd9h-master-1" IP couldn't be determined via DNS: unableto locate a node for peerURL=https://etcd-1.ci-ln-gk8wrgt-002ac.ci.azure.devcluster.openshift.com:2380, dnsErrors=lookup _etcd-server-ssl._tcp.ci-ln-gk8wrgt-002ac.ci.azure.devcluster.openshift.com on 172.30.0.10:53: no such host; will attempt a fallback lookup 88s Normal MemberSettingIPPeer deployment/etcd-operator member "etcd-member-ci-ln-gk8wrgt-002ac-dmd9h-master-1"; new peer list https://10.0.0.5:2380 88s Normal MemberUpdate deployment/etcd-operator updating member "etcd-member-ci-ln-gk8wrgt-002ac-dmd9h-master-1" with peers https://10.0.0.5:2380 88s Normal MemberMissingIPPeer deployment/etcd-operator member "etcd-member-ci-ln-gk8wrgt-002ac-dmd9h-master-2" is missing an IP in the peer list 76s Warning MemberIPLookupFailed deployment/etcd-operator member "etcd-member-ci-ln-gk8wrgt-002ac-dmd9h-master-2" IP couldn't be determined via DNS: unableto locate a node for peerURL=https://etcd-2.ci-ln-gk8wrgt-002ac.ci.azure.devcluster.openshift.com:2380, dnsErrors=lookup _etcd-server-ssl._tcp.ci-ln-gk8wrgt-002ac.ci.azure.devcluster.openshift.com on 172.30.0.10:53: no such host; will attempt a fallback lookup 76s Normal MemberSettingIPPeer deployment/etcd-operator member "etcd-member-ci-ln-gk8wrgt-002ac-dmd9h-master-2"; new peer list https://10.0.0.7:2380 76s Normal MemberUpdate deployment/etcd-operator updating member "etcd-member-ci-ln-gk8wrgt-002ac-dmd9h-master-2" with peers https://10.0.0.7:2380 The `MemberIPLookupFailed` errors are what trigger the fallback logic (notice the "will attempt a fallback lookup" message). The `MemberSettingIPPeer` and `MemberUpdate` are signals that the DNS workaround was successful and new IPs assigned to the members. Mangirdas, Can your team test this independently? It's easy for us to build a custom 4.4 image containing the fix. *** Bug 1864436 has been marked as a duplicate of this bug. *** @Dan, We can try if this still needed. Is this new release image or etcd-operator image? MJ (In reply to Mangirdas Judeikis from comment #12) > @Dan, > > We can try if this still needed. Is this new release image or etcd-operator > image? > > MJ You can use @cluster-bot to build an image from https://github.com/openshift/cluster-etcd-operator/pull/419 to which you can upgrade a cluster. Please feel free to reach out directly if you need any help working through that process. I a bit puzzeld. Tested old behaviour and new and upgrade succeded. But I was not able to observe MemberIPLookupFailed or MemberSettingIPPeer. Result positive, but not sure where the events gone... Standard upgrade: 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"720dfeac-a1a0-4dd1-bcdf-c37d3669d9cc", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'MemberMissingIPPeer' member "etcd-member-mjudeikis-j269c-master-2" is missing an IP in the peer list E0814 10:20:10.823407 1 etcdmemberipmigrator.go:314] key failed with : unable to locate a node for peerURL=https://etcd-2.xw1rnv4j.v4-eastus.osadev.cloud:2380, dnsErrors=lookup _etcd-server-ssl._tcp.xw1rnv4j.v4-eastus.osadev.cloud on 172.30.0.10:53: no such host I0814 10:20:11.928081 1 request.go:621] Throttling request took 1.181303068s, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/installer-2-mjudeikis-j269c-master-2 I0814 10:20:12.713978 1 etcdcli.go:96] service/host-etcd-2 is missing annotation alpha.installer.openshift.io/etcd-bootstrap I0814 10:20:12.744864 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"720dfeac-a1a0-4dd1-bcdf-c37d3669d9cc", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/installer-2-mjudeikis-j269c-master-2 -n openshift-etcd because it was missing I0814 10:20:12.745975 1 etcdcli.go:96] service/host-etcd-2 is missing annotation alpha.installer.openshift.io/etc Events from successful upgrade: 0s Normal OperatorStatusChanged deployment/etcd-operator Status for clusteroperator/etcd changed: Degraded message changed from "EtcdMemberIPMigratorDegraded: unable to locate a node for peerURL=https://etcd-2.xw1rnv4j.v4-eastus.osadev.cloud:2380, dnsErrors=lookup _etcd-serv er-ssl._tcp.xw1rnv4j.v4-eastus.osadev.cloud on 172.30.0.10:53: no such host" to "EtcdMembersControllerDegraded: node lister not synced\nEtcdMemberIPMigratorDegraded: unable to locate a node for peerURL=https://etcd-2.xw1rnv4j.v4-eastus.osadev.cloud:2380, dnsErrors=lookup _etcd-server-ssl._tcp.xw1rnv4j.v4-eastus.osade v.cloud on 172.30.0.10:53: no such host" 90s Normal OperatorStatusChanged deployment/etcd-operator Status for clusteroperator/etcd changed: Degraded message changed from "EtcdMembersControllerDegraded: node lister not synced\nEtcdMemberIPMigratorDegraded: unable to locate a node for peerURL=https://etcd-2.xw1rnv4j.v 4-eastus.osadev.cloud:2380, dnsErrors=lookup _etcd-server-ssl._tcp.xw1rnv4j.v4-eastus.osadev.cloud on 172.30.0.10:53: no such host" to "EtcdMembersControllerDegraded: node lister not synced\nBootstrapTeardownDegraded: node lister not synced\nEtcdMemberIPMigratorDegraded: unable to locate a node for peerURL=https://et cd-2.xw1rnv4j.v4-eastus.osadev.cloud:2380, dnsErrors=lookup _etcd-server-ssl._tcp.xw1rnv4j.v4-eastus.osadev.cloud on 172.30.0.10:53: no such host" 89s Normal OperatorStatusChanged deployment/etcd-operator Status for clusteroperator/etcd changed: Degraded message changed from "EtcdMembersControllerDegraded: node lister not synced\nBootstrapTeardownDegraded: node lister not synced\nEtcdMemberIPMigratorDegraded: unable to locate a node for peerURL=https://etcd-2.xw1rnv4j.v4-eastus.osadev.cloud:2380, dnsErrors=lookup _etcd-server-ssl._tcp.xw1rnv4j.v4-eastus.osadev.cloud on 172.30.0.10:53: no such host" to "EtcdMembersControllerDegraded: node lister not synced\nBootstrapTeardownDegraded: node lister not synced\nEtcdMemberIPMigratorDegra ded: unable to locate a node for peerURL=https://etcd-2.xw1rnv4j.v4-eastus.osadev.cloud:2380, dnsErrors=lookup _etcd-server-ssl._tcp.xw1rnv4j.v4-eastus.osadev.cloud on 172.30.0.10:53: no such host\nClusterMemberControllerDegraded: node lister not synced" 89s Normal OperatorStatusChanged deployment/etcd-operator Status for clusteroperator/etcd changed: Degraded changed from True to False ("EtcdMembersControllerDegraded: node lister not synced\nBootstrapTeardownDegraded: node lister not synced\nNodeControllerDegraded: All maste r nodes are ready\nClusterMemberControllerDegraded: node lister not synced\nEtcdMembersDegraded: No unhealthy members found") 88s Normal OperatorStatusChanged deployment/etcd-operator Status for clusteroperator/etcd changed: Degraded message changed from "EtcdMembersControllerDegraded: node lister not synced\nBootstrapTeardownDegraded: node lister not synced\nNodeControllerDegraded: All master nodes are ready\nClusterMemberControllerDegraded: node lister not synced\nEtcdMembersDegraded: No unhealthy members found" to "EtcdMembersControllerDegraded: node lister not synced\nBootstrapTeardownDegraded: node lister not synced\nNodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy memb ers found" 88s Normal OperatorStatusChanged deployment/etcd-operator Status for clusteroperator/etcd changed: Degraded message changed from "EtcdMembersControllerDegraded: node lister not synced\nBootstrapTeardownDegraded: node lister not synced\nNodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "EtcdMembersControllerDegraded: node lister not synced\nNodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" 88s Normal OperatorStatusChanged deployment/etcd-operator Status for clusteroperator/etcd changed: Degraded message changed from "EtcdMembersControllerDegraded: node lister not synced\nNodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy membe rs found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" 84s Normal ConfigMapUpdated deployment/etcd-operator Updated ConfigMap/etcd-pod -n openshift-etcd: cause by changes in data.pod.yaml,data.version 84s Normal RevisionTriggered deployment/etcd-operator new revision 3 triggered by "configmap/etcd-pod has changed" Did more testing. All good 82s Warning MemberIPLookupFailed deployment/etcd-operator member "mjudeikis-ggpdf-master-1" IP couldn't be determined via DNS: unable to locate a node for peerURL=https://etcd-1.iy5kl2c0.v4-westeurope.osadev.cloud:2380, dnsErrors=lookup _etcd-server-ssl._tcp.iy5kl2c0.v4-westeurope.os adev.cloud on 172.30.0.10:53: no such host; will attempt a fallback lookup 82s Normal MemberSettingIPPeer deployment/etcd-operator member "mjudeikis-ggpdf-master-1"; new peer list https://10.5.0.8:2380 And result: sh-4.2# etcdctl member list 7c1c361f137ec111, started, mjudeikis-ggpdf-master-2, https://10.5.0.9:2380, https://10.5.0.9:2379 8099625059b34b7a, started, mjudeikis-ggpdf-master-0, https://10.5.0.7:2380, https://10.5.0.7:2379 aa431bcae9995c87, started, mjudeikis-ggpdf-master-1, https://10.5.0.8:2380, https://10.5.0.8:2379 This bug doesn't affect 4.5+ upgrades; I cloned https://bugzilla.redhat.com/show_bug.cgi?id=1869681 to track the 4.4.z work. @dmace @mj This bug moved to POST->MODIFIED->ON_QA but I don't see a PR attached for 4.5. Does QE need to verify this for 4.5.z? or just CLOSE it. It is currently blocking the merge of https://github.com/openshift/cluster-etcd-operator/pull/419 The bug only applies to 4.3 -> 4.4 upgrades. There is no bug to fix in the 4.5 release, and so there will be no 4.5 PR. This bug exists only to allow the 4.4 PR to merge and satisfy the overall process. (In reply to Dan Mace from comment #20) > This bug exists only to allow the 4.4 PR to merge and satisfy the overall process. I mean that the 4.5 bug only exists to satisfy process. This bug, for 4.4, makes sense because the fix is delivered in a PR against the 4.4 branch. Hope that clarifies! Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5.8 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3510 |