Bug 1881865

Summary: AWS machine does not fail on missing userData secret
Product: OpenShift Container Platform Reporter: Danil Grigorev <dgrigore>
Component: Cloud ComputeAssignee: Danil Grigorev <dgrigore>
Cloud Compute sub component: Other Providers QA Contact: Milind Yadav <miyadav>
Status: CLOSED DUPLICATE Docs Contact:
Severity: low    
Priority: unspecified CC: zhsun
Version: 4.6   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-02 16:14:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Danil Grigorev 2020-09-23 09:07:17 UTC
Description of problem:

Creating a machine with a non-existing userData secret reference results in machine forever stuck in Pending state

Version-Release number of selected component (if applicable):

4.6.0-0.nightly-2020-09-23-022756

How reproducible:

Always

Steps to Reproduce:
1. Create a machine with a nonsense secret reference
2. See the machine continue trying to reconcile and stay in Pending state
3.

Actual results:

NAME                                                PHASE          TYPE        REGION      ZONE         AGE
ci-ln-cvtd672-d5d6b-bswt7-master-0                  Running        m5.xlarge   us-east-2   us-east-2a   37m
ci-ln-cvtd672-d5d6b-bswt7-master-1                  Running        m5.xlarge   us-east-2   us-east-2b   37m
ci-ln-cvtd672-d5d6b-bswt7-master-2                  Running        m5.xlarge   us-east-2   us-east-2a   37m
ci-ln-cvtd672-d5d6b-bswt7-worker-us-east-2a-cg29n   Running        m4.xlarge   us-east-2   us-east-2a   26m
ci-ln-cvtd672-d5d6b-bswt7-worker-us-east-2a-skvqf   Running        m4.xlarge   us-east-2   us-east-2a   26m
ci-ln-cvtd672-d5d6b-bswt7-worker-us-east-2b-k67fv   Running        m4.xlarge   us-east-2   us-east-2b   26m
test-qcmck                                          Provisioning                                        7m38s

Machine:
...
      userDataSecret:
        name: worker-user-data # Was used in 4.5, now is called master-user-data-managed

Logs: 

E0923 08:28:52.715694       1 actuator.go:66] test-qcmck error: test-qcmck: reconciler failed to Create machine: failed to get user data: Secret "worker-user-data" not found
W0923 08:28:52.715732       1 controller.go:315] test-qcmck: failed to create machine: test-qcmck: reconciler failed to Create machine: failed to get user data: Secret "worker-user-data" not found
E0923 08:28:52.715779       1 controller.go:237] controller "msg"="Reconciler error" "error"="test-qcmck: reconciler failed to Create machine: failed to get user data: Secret \"worker-user-data\" not found" "controller"="machine_controller" "name"="test-qcmck" "namespace"="openshift-machine-api" 
I0923 08:28:52.715836       1 recorder.go:52] controller-runtime/manager/events "msg"="Warning"  "message"="test-qcmck: reconciler failed to Create machine: failed to get user data: Secret \"worker-user-data\" not found" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"test-qcmck","uid":"a24a0a70-78f4-47e3-9662-323bc5784f9e","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"37504"} "reason"="FailedCreate"
I0923 08:28:53.715962       1 controller.go:169] test-qcmck: reconciling Machine
I0923 08:28:53.715998       1 actuator.go:100] test-qcmck: actuator checking if machine exists
I0923 08:28:53.796688       1 reconciler.go:246] test-qcmck: Instance does not exist
I0923 08:28:53.796710       1 controller.go:313] test-qcmck: reconciling machine triggers idempotent create
I0923 08:28:53.796715       1 actuator.go:75] test-qcmck: actuator creating machine
I0923 08:28:53.797157       1 reconciler.go:38] test-qcmck: creating machine


Expected results:

Machine show Failed phase and stop reconciliation

Additional info:

Comment 1 Joel Speed 2020-09-23 09:25:47 UTC
Does this same behaviour present itself in other providers as well? Or do they fail straight away if the user data secret is missing? 

Do we absolutely need the userdatasecret to exist before we create the machine or can it come later? Is this a valid use case?

Comment 2 Joel Speed 2020-09-23 10:09:31 UTC
This isn't a 4.6 blocker so defer to 4.7 for further discussion

Comment 3 Danil Grigorev 2020-11-02 16:14:03 UTC
Closing this as a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1805639 The fix for the BZ will be generic for all providers. The userData secret may be missing right after cluster install, which should not cause machines to fail at once.

*** This bug has been marked as a duplicate of bug 1805639 ***