Bug 1776767

Summary: machine config operator go to degrade state and complain some rendered-master machineconfig is not found when enable proxy
Product: OpenShift Container Platform Reporter: Johnny Liu <jialiu>
Component: Machine Config OperatorAssignee: Antonio Murdaca <amurdaca>
Status: CLOSED DUPLICATE QA Contact: Michael Nguyen <mnguyen>
Severity: high Docs Contact:
Priority: high    
Version: 4.2.zKeywords: Regression, TestBlocker
Target Milestone: ---   
Target Release: 4.2.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-26 11:50:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Johnny Liu 2019-11-26 10:48:02 UTC
Description of problem:


Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-11-24-111327

How reproducible:
Always

Steps to Reproduce:
1. Drop internet connection from the private subnets in this VPC
2. Launch a proxy server in public subnets of this VPC
3. Trigger a UPI install on aws with proxy enabled.

Actual results:
Installation failed.
$ ./openshift-install wait-for install-complete --dir '/home/installer2/workspace/Launch Environment Flexy/workdir/install-dir'
level=info msg="Waiting up to 30m0s for the cluster at https://api.jialiu425.qe.devcluster.openshift.com:6443 to initialize..."

level=fatal msg="failed to initialize the cluster: Cluster operator machine-config is reporting a failure: Failed to resync 4.2.0-0.nightly-2019-11-24-111327 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with \"3 nodes are reporting degraded status on sync\": \"Node ip-10-0-72-7.us-east-2.compute.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-a3be962e3ac25c82a501d894dc950be5\\\\\\\" not found\\\", Node ip-10-0-61-75.us-east-2.compute.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-a3be962e3ac25c82a501d894dc950be5\\\\\\\" not found\\\", Node ip-10-0-56-33.us-east-2.compute.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-a3be962e3ac25c82a501d894dc950be5\\\\\\\" not found\\\"\", retrying"

After the installation failed, check clusteroperators, only machine-config get to degrade state.

# oc describe co machine-config
Name:         machine-config
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-11-26T09:42:04Z
  Generation:          1
  Resource Version:    29724
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/machine-config
  UID:                 00c49323-1031-11ea-9aab-02a0248741b0
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-11-26T09:42:04Z
    Message:               Cluster not available for 4.2.0-0.nightly-2019-11-24-111327
    Status:                False
    Type:                  Available
    Last Transition Time:  2019-11-26T09:42:04Z
    Message:               Cluster is bootstrapping 4.2.0-0.nightly-2019-11-24-111327
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2019-11-26T09:52:30Z
    Message:               Failed to resync 4.2.0-0.nightly-2019-11-24-111327 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node ip-10-0-72-7.us-east-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-a3be962e3ac25c82a501d894dc950be5\\\" not found\", Node ip-10-0-61-75.us-east-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-a3be962e3ac25c82a501d894dc950be5\\\" not found\", Node ip-10-0-56-33.us-east-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-a3be962e3ac25c82a501d894dc950be5\\\" not found\"", retrying
    Reason:                RequiredPoolsFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2019-11-26T09:52:30Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:
    Last Sync Error:  pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node ip-10-0-72-7.us-east-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-a3be962e3ac25c82a501d894dc950be5\\\" not found\", Node ip-10-0-61-75.us-east-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-a3be962e3ac25c82a501d894dc950be5\\\" not found\", Node ip-10-0-56-33.us-east-2.compute.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-a3be962e3ac25c82a501d894dc950be5\\\" not found\"", retrying
    Worker:           all 2 nodes are at latest configuration rendered-worker-711e924795d1c0192f461a1c551f621f
  Related Objects:
    Group:     
    Name:      openshift-machine-config-operator
    Resource:  namespaces
    Group:     machineconfiguration.openshift.io
    Name:      master
    Resource:  machineconfigpools
    Group:     machineconfiguration.openshift.io
    Name:      worker
    Resource:  machineconfigpools
    Group:     machineconfiguration.openshift.io
    Name:      machine-config-controller
    Resource:  controllerconfigs
  Versions:
    Name:     operator
    Version:  4.2.0-0.nightly-2019-11-24-111327
Events:       <none>


Expected results:
installation should be passed.


Additional info:
1. The released version 4.2.8 + proxy works well.
2. Nightly build 4.2.0-0.nightly-2019-11-24-111327 + proxy, failed.
3. Nightly build 4.2.0-0.nightly-2019-11-24-111327 + no proxy, passed.

Comment 3 Antonio Murdaca 2019-11-26 11:20:12 UTC
The MCO didn't change between the nightlies that you mentioned:

12:19:01 [~] oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-11-24-111327 | grep machine-config-operator
  machine-config-operator                       https://github.com/openshift/machine-config-operator                       d780d197a9c5848ba786982c0c4aaa7487297046
12:19:13 [~] oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-11-24-111327 | grep machine-config-operator
  machine-config-operator                       https://github.com/openshift/machine-config-operator                       d780d197a9c5848ba786982c0c4aaa7487297046

Comment 4 Antonio Murdaca 2019-11-26 11:20:38 UTC
(In reply to Antonio Murdaca from comment #3)
> The MCO didn't change between the nightlies that you mentioned:
> 
> 12:19:01 [~] oc adm release info --commits
> registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-11-24-111327
> | grep machine-config-operator
>   machine-config-operator                      
> https://github.com/openshift/machine-config-operator                      
> d780d197a9c5848ba786982c0c4aaa7487297046
> 12:19:13 [~] oc adm release info --commits
> registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-11-24-111327
> | grep machine-config-operator
>   machine-config-operator                      
> https://github.com/openshift/machine-config-operator                      
> d780d197a9c5848ba786982c0c4aaa7487297046

ok, forget that, didn't read "no proxy"

Comment 6 Johnny Liu 2019-11-26 11:26:06 UTC
Sound like some side effect introduced by Bug 1770223.


In the fix of 1770223, api.jialiu425.qe.devcluster.openshift.com is removed, inside cluster, should use api-int to instead api.

Comment 7 Antonio Murdaca 2019-11-26 11:27:57 UTC
(In reply to Johnny Liu from comment #6)
> Sound like some side effect introduced by Bug 1770223.
> 
> 
> In the fix of 1770223, api.jialiu425.qe.devcluster.openshift.com is removed,
> inside cluster, should use api-int to instead api.

likely, yeah