Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1864436

Summary: 4.3 -> 4.4 upgrade fails on cluster configured with private DNS forwarding
Product: OpenShift Container Platform Reporter: Elana Hashman <ehashman>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED DUPLICATE QA Contact: ge liu <geliu>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.3.zCC: dmace, mjudeiki
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-08-10 19:05:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Elana Hashman 2020-08-03 19:26:08 UTC
Description of problem: We attempted to upgrade a customer cluster from 4.3.26 to 4.4.10. The upgrade has gotten stuck with the following error:

'Cluster operator etcd is reporting a failure: EtcdMemberIPMigratorDegraded: unable to locate a node for peerURL=https://etcd-1.<cluster DNS name>:****, dnsErrors=lookup _etcd-server-ssl._tcp.<cluster DNS name> on ***.***.***.***:**: no such host'


Please advise on possible mitigations/fixes.

Note that this is the same customer/cluster with configuration provided in #1861925.


Here is the status of the etcd operator:

apiVersion: operator.openshift.io/v1
items:
- apiVersion: operator.openshift.io/v1
  kind: Etcd
  metadata:
    annotations:
      release.openshift.io/create-only: 'true'
    creationTimestamp: 2020-07-30T21:30:07Z
    generation: 1
    name: cluster
    resourceVersion: '63601688'
    selfLink: /apis/operator.openshift.io/v1/etcds/cluster
    uid: <uuid>
  spec:
    managementState: Managed
  status:
    conditions:
    - lastTransitionTime: 2020-07-30T21:31:30Z
      reason: MembersReported
      status: False
      type: EtcdMembersControllerDegraded
    - lastTransitionTime: 2020-07-30T21:30:14Z
      reason: NoUnsupportedConfigOverrides
      status: True
      type: UnsupportedConfigOverridesUpgradeable
    - lastTransitionTime: 2020-07-30T21:35:56Z
      reason: AsExpected
      status: False
      type: BootstrapTeardownDegraded
    - lastTransitionTime: 2020-07-30T21:31:36Z
      status: False
      type: InstallerControllerDegraded
    - lastTransitionTime: 2020-07-30T21:32:14Z
      message: 3 nodes are active; 3 nodes are at revision 2
      status: True
      type: StaticPodsAvailable
    - lastTransitionTime: 2020-07-30T21:34:07Z
      message: 3 nodes are at revision 2
      reason: AllNodesAtLatestRevision
      status: False
      type: NodeInstallerProgressing
    - lastTransitionTime: 2020-07-30T21:30:15Z
      status: False
      type: NodeInstallerDegraded
    - lastTransitionTime: 2020-07-30T21:34:06Z
      status: False
      type: StaticPodsDegraded
    - lastTransitionTime: 2020-07-30T21:30:34Z
      reason: AsExpected
      status: False
      type: ScriptControllerDegraded
    - lastTransitionTime: 2020-08-03T17:29:12Z
      reason: AsExpected
      status: False
      type: ClusterMemberControllerDegraded
    - lastTransitionTime: 2020-07-30T21:30:16Z
      message: 'unable to locate a node for peerURL=https://etcd-1.<cluster DNS name>:2380, dnsErrors=lookup _etcd-server-ssl._tcp.<cluster DNS name> on 172.30.0.10:53: no such host'
      reason: Error
      status: True
      type: EtcdMemberIPMigratorDegraded
    - lastTransitionTime: 2020-07-30T21:30:17Z
      status: False
      type: ConfigObservationDegraded
    - lastTransitionTime: 2020-07-30T21:30:22Z
      reason: AsExpected
      status: False
      type: EtcdStaticResourcesDegraded
    - lastTransitionTime: 2020-07-30T21:30:24Z
      message: All master nodes are ready
      reason: MasterNodesReady
      status: False
      type: NodeControllerDegraded
    - lastTransitionTime: 2020-07-30T21:39:22Z
      reason: HostEndpoints2Updated
      status: False
      type: HostEndpoints2Degraded
    - lastTransitionTime: 2020-07-30T21:39:22Z
      reason: AsExpected
      status: False
      type: EnvVarControllerDegraded
    - lastTransitionTime: 2020-07-30T21:30:38Z
      status: False
      type: ResourceSyncControllerDegraded
    - lastTransitionTime: 2020-07-30T21:30:39Z
      reason: AsExpected
      status: False
      type: BackingResourceControllerDegraded
    - lastTransitionTime: 2020-07-30T21:30:40Z
      status: False
      type: InstallerPodPendingDegraded
    - lastTransitionTime: 2020-07-30T21:30:40Z
      status: False
      type: InstallerPodContainerWaitingDegraded
    - lastTransitionTime: 2020-07-30T21:30:40Z
      status: False
      type: InstallerPodNetworkingDegraded
    - lastTransitionTime: 2020-07-30T21:30:41Z
      status: False
      type: TargetConfigControllerDegraded
    - lastTransitionTime: 2020-07-30T21:31:35Z
      status: False
      type: RevisionControllerDegraded
    - lastTransitionTime: 2020-07-30T21:31:30Z
      message: etcd-bootstrap member is already removed
      reason: BootstrapAlreadyRemoved
      status: True
      type: EtcdRunningInCluster
    - lastTransitionTime: 2020-07-30T21:31:30Z
      message: No unhealthy members found
      reason: AsExpected
      status: False
      type: EtcdMembersDegraded
    - lastTransitionTime: 2020-07-30T21:31:30Z
      message: No unstarted etcd members found
      reason: AsExpected
      status: False
      type: EtcdMembersProgressing
    - lastTransitionTime: 2020-07-30T21:31:30Z
      message: 3 members are available
      reason: EtcdQuorate
      status: True
      type: EtcdMembersAvailable
    - lastTransitionTime: 2020-07-30T21:31:32Z
      reason: AsExpected
      status: False
      type: EtcdCertSignerControllerDegraded
    latestAvailableRevision: 2
    latestAvailableRevisionReason: ''
    nodeStatuses:
    - currentRevision: 2
      nodeName: aro01-sk6gn-master-0
    - currentRevision: 2
      nodeName: aro01-sk6gn-master-1
    - currentRevision: 2
      nodeName: aro01-sk6gn-master-2
    readyReplicas: 0


Version-Release number of selected component (if applicable): 4.3.26 -> 4.4.10


How reproducible: To reproduce this setup, you would need to configure private DNS forwarding with a VPN.


Expected results: Cluster should successfully upgrade.

Actual results: Cluster upgrade is blocked on etcd-operator upgrade failure.

Comment 1 Elana Hashman 2020-08-05 16:16:26 UTC
This is likely a duplicate of #1865806 (filed later), mitigation is being discussed there.

Comment 2 Dan Mace 2020-08-10 19:05:56 UTC
This issue predates 1865806 but for all practical purposes 1865806 became the canonical tracker, so I'm going to go ahead and close this one as a dupe. Thanks!

*** This bug has been marked as a duplicate of bug 1865806 ***