Bug 1881865 - AWS machine does not fail on missing userData secret
Summary: AWS machine does not fail on missing userData secret
Keywords:
Status: CLOSED DUPLICATE of bug 1805639
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.7.0
Assignee: Danil Grigorev
QA Contact: Milind Yadav
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-23 09:07 UTC by Danil Grigorev
Modified: 2020-11-02 16:14 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-02 16:14:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Danil Grigorev 2020-09-23 09:07:17 UTC
Description of problem:

Creating a machine with a non-existing userData secret reference results in machine forever stuck in Pending state

Version-Release number of selected component (if applicable):

4.6.0-0.nightly-2020-09-23-022756

How reproducible:

Always

Steps to Reproduce:
1. Create a machine with a nonsense secret reference
2. See the machine continue trying to reconcile and stay in Pending state
3.

Actual results:

NAME                                                PHASE          TYPE        REGION      ZONE         AGE
ci-ln-cvtd672-d5d6b-bswt7-master-0                  Running        m5.xlarge   us-east-2   us-east-2a   37m
ci-ln-cvtd672-d5d6b-bswt7-master-1                  Running        m5.xlarge   us-east-2   us-east-2b   37m
ci-ln-cvtd672-d5d6b-bswt7-master-2                  Running        m5.xlarge   us-east-2   us-east-2a   37m
ci-ln-cvtd672-d5d6b-bswt7-worker-us-east-2a-cg29n   Running        m4.xlarge   us-east-2   us-east-2a   26m
ci-ln-cvtd672-d5d6b-bswt7-worker-us-east-2a-skvqf   Running        m4.xlarge   us-east-2   us-east-2a   26m
ci-ln-cvtd672-d5d6b-bswt7-worker-us-east-2b-k67fv   Running        m4.xlarge   us-east-2   us-east-2b   26m
test-qcmck                                          Provisioning                                        7m38s

Machine:
...
      userDataSecret:
        name: worker-user-data # Was used in 4.5, now is called master-user-data-managed

Logs: 

E0923 08:28:52.715694       1 actuator.go:66] test-qcmck error: test-qcmck: reconciler failed to Create machine: failed to get user data: Secret "worker-user-data" not found
W0923 08:28:52.715732       1 controller.go:315] test-qcmck: failed to create machine: test-qcmck: reconciler failed to Create machine: failed to get user data: Secret "worker-user-data" not found
E0923 08:28:52.715779       1 controller.go:237] controller "msg"="Reconciler error" "error"="test-qcmck: reconciler failed to Create machine: failed to get user data: Secret \"worker-user-data\" not found" "controller"="machine_controller" "name"="test-qcmck" "namespace"="openshift-machine-api" 
I0923 08:28:52.715836       1 recorder.go:52] controller-runtime/manager/events "msg"="Warning"  "message"="test-qcmck: reconciler failed to Create machine: failed to get user data: Secret \"worker-user-data\" not found" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"test-qcmck","uid":"a24a0a70-78f4-47e3-9662-323bc5784f9e","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"37504"} "reason"="FailedCreate"
I0923 08:28:53.715962       1 controller.go:169] test-qcmck: reconciling Machine
I0923 08:28:53.715998       1 actuator.go:100] test-qcmck: actuator checking if machine exists
I0923 08:28:53.796688       1 reconciler.go:246] test-qcmck: Instance does not exist
I0923 08:28:53.796710       1 controller.go:313] test-qcmck: reconciling machine triggers idempotent create
I0923 08:28:53.796715       1 actuator.go:75] test-qcmck: actuator creating machine
I0923 08:28:53.797157       1 reconciler.go:38] test-qcmck: creating machine


Expected results:

Machine show Failed phase and stop reconciliation

Additional info:

Comment 1 Joel Speed 2020-09-23 09:25:47 UTC
Does this same behaviour present itself in other providers as well? Or do they fail straight away if the user data secret is missing? 

Do we absolutely need the userdatasecret to exist before we create the machine or can it come later? Is this a valid use case?

Comment 2 Joel Speed 2020-09-23 10:09:31 UTC
This isn't a 4.6 blocker so defer to 4.7 for further discussion

Comment 3 Danil Grigorev 2020-11-02 16:14:03 UTC
Closing this as a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1805639 The fix for the BZ will be generic for all providers. The userData secret may be missing right after cluster install, which should not cause machines to fail at once.

*** This bug has been marked as a duplicate of bug 1805639 ***


Note You need to log in before you can comment on or make changes to this bug.