Description of problem: Creating a machine with a non-existing userData secret reference results in machine forever stuck in Pending state Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-09-23-022756 How reproducible: Always Steps to Reproduce: 1. Create a machine with a nonsense secret reference 2. See the machine continue trying to reconcile and stay in Pending state 3. Actual results: NAME PHASE TYPE REGION ZONE AGE ci-ln-cvtd672-d5d6b-bswt7-master-0 Running m5.xlarge us-east-2 us-east-2a 37m ci-ln-cvtd672-d5d6b-bswt7-master-1 Running m5.xlarge us-east-2 us-east-2b 37m ci-ln-cvtd672-d5d6b-bswt7-master-2 Running m5.xlarge us-east-2 us-east-2a 37m ci-ln-cvtd672-d5d6b-bswt7-worker-us-east-2a-cg29n Running m4.xlarge us-east-2 us-east-2a 26m ci-ln-cvtd672-d5d6b-bswt7-worker-us-east-2a-skvqf Running m4.xlarge us-east-2 us-east-2a 26m ci-ln-cvtd672-d5d6b-bswt7-worker-us-east-2b-k67fv Running m4.xlarge us-east-2 us-east-2b 26m test-qcmck Provisioning 7m38s Machine: ... userDataSecret: name: worker-user-data # Was used in 4.5, now is called master-user-data-managed Logs: E0923 08:28:52.715694 1 actuator.go:66] test-qcmck error: test-qcmck: reconciler failed to Create machine: failed to get user data: Secret "worker-user-data" not found W0923 08:28:52.715732 1 controller.go:315] test-qcmck: failed to create machine: test-qcmck: reconciler failed to Create machine: failed to get user data: Secret "worker-user-data" not found E0923 08:28:52.715779 1 controller.go:237] controller "msg"="Reconciler error" "error"="test-qcmck: reconciler failed to Create machine: failed to get user data: Secret \"worker-user-data\" not found" "controller"="machine_controller" "name"="test-qcmck" "namespace"="openshift-machine-api" I0923 08:28:52.715836 1 recorder.go:52] controller-runtime/manager/events "msg"="Warning" "message"="test-qcmck: reconciler failed to Create machine: failed to get user data: Secret \"worker-user-data\" not found" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"test-qcmck","uid":"a24a0a70-78f4-47e3-9662-323bc5784f9e","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"37504"} "reason"="FailedCreate" I0923 08:28:53.715962 1 controller.go:169] test-qcmck: reconciling Machine I0923 08:28:53.715998 1 actuator.go:100] test-qcmck: actuator checking if machine exists I0923 08:28:53.796688 1 reconciler.go:246] test-qcmck: Instance does not exist I0923 08:28:53.796710 1 controller.go:313] test-qcmck: reconciling machine triggers idempotent create I0923 08:28:53.796715 1 actuator.go:75] test-qcmck: actuator creating machine I0923 08:28:53.797157 1 reconciler.go:38] test-qcmck: creating machine Expected results: Machine show Failed phase and stop reconciliation Additional info:
Does this same behaviour present itself in other providers as well? Or do they fail straight away if the user data secret is missing? Do we absolutely need the userdatasecret to exist before we create the machine or can it come later? Is this a valid use case?
This isn't a 4.6 blocker so defer to 4.7 for further discussion
Closing this as a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1805639 The fix for the BZ will be generic for all providers. The userData secret may be missing right after cluster install, which should not cause machines to fail at once. *** This bug has been marked as a duplicate of bug 1805639 ***