Bug 1864677

Summary: AWS install with custom service endpoints fails with machine-config not completing.
Product: OpenShift Container Platform Reporter: Abhinav Dahiya <adahiya>
Component: InstallerAssignee: aos-install
Installer sub component: openshift-installer QA Contact: Yunfei Jiang <yunjiang>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high    
Version: 4.6   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:23:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Abhinav Dahiya 2020-08-03 19:43:05 UTC
Description of problem:
When using AWS install with service endpoints, machine-config operator fails to complete due to mimatch in the configuration of the control-plane nodes.

```
Unable to apply 4.6.0-0.nightly-2020-08-03-054919: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node ip-10-0-222-164.ca-central-1.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-ddd5e0c6170a64a9525d862a998ad685\\\" not found\", Node ip-10-0-144-144.ca-central-1.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-ddd5e0c6170a64a9525d862a998ad685\\\" not found\", Node ip-10-0-189-142.ca-central-1.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-ddd5e0c6170a64a9525d862a998ad685\\\" not found\"
```

The reasoning seems to be because the MCO is serving the cloud provider config from the bootstrap based on the infrastructure spec but the in-cluster users the cloud provider config based on the generated Configmap in openshift-config-managed/kube-cloud-config .

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
````
$ yq m -CP -x aws-install-config.yaml elide-install-config.yaml      apiVersion: v1
baseDomain: devcluster.openshift.com
controlPlane:
  name: master
  replicas: 3
compute:
- name: worker
  replicas: 3
metadata:
  name: adahiya-2
platform:
  aws:
    region: ca-central-1
    serviceEndpoints:
    - name: sns
      url: https://localhost:4567
pullSecret: ""
sshKey: ""
```
```
➜  $ ./bin/openshift-install --dir dev create cluster
INFO Consuming Install Config from target directory
INFO Credentials loaded from the "default" profile in file "/home/adahiya/.aws/credentials"
WARNING Found override for release image. Please be warned, this is not advised
INFO Creating infrastructure resources...
INFO Waiting up to 20m0s for the Kubernetes API at https://api.adahiya-2.devcluster.openshift.com:6443...
INFO API v4.6.0-202008011154.p0-dirty up
INFO Waiting up to 30m0s for bootstrapping to complete...
INFO Destroying the bootstrap resources...
INFO Waiting up to 30m0s for the cluster at https://api.adahiya-2.devcluster.openshift.com:6443 to initialize...
INFO Cluster operator insights Disabled is False with AsExpected:
INFO Cluster operator machine-config Progressing is True with : Working towards 4.6.0-0.nightly-2020-08-03-054919
ERROR Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Unable to apply 4.6.0-0.nightly-2020-08-03-054919: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node ip-10-0-222-164.ca-central-1.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-ddd5e0c6170a64a9525d862a998ad685\\\" not found\", Node ip-10-0-144-144.ca-central-1.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-ddd5e0c6170a64a9525d862a998ad685\\\" not found\", Node ip-10-0-189-142.ca-central-1.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-ddd5e0c6170a64a9525d862a998ad685\\\" not found\"", retrying
INFO Cluster operator machine-config Available is False with : Cluster not available for 4.6.0-0.nightly-2020-08-03-054919
FATAL failed to initialize the cluster: Cluster operator machine-config is still updating
```

Expected results:

MCO should be using the generated config map on the bootstrap host for correctness.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 3 Yunfei Jiang 2020-08-17 10:38:45 UTC
verified. PASS.
version: 4.6.0-0.nightly-2020-08-16-072105

NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-config                             4.6.0-0.nightly-2020-08-16-072105   True        False         False      60m

Comment 5 errata-xmlrpc 2020-10-27 16:23:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196