Description of problem: On upi vsphere, do some configuration to enable machineset, machine stuck in Provisioned status, csr are not automatically approved. Version-Release number of selected component (if applicable): 4.5.0-0.nightly-2020-06-02-044312 How reproducible: Always Steps to Reproduce: 1. Do some configuration to enable machineset 2. Check machine status 3. Check csrs Actual results: Csrs are not automatically approved, machine stuck in Provisioned status. $ oc get machine NAME PHASE TYPE REGION ZONE AGE upg-0602445-762m2-worker-5465w Provisioned 4h3m $ oc get csr NAME AGE SIGNERNAME REQUESTOR CONDITION csr-5jpx9 56m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-82nm8 10m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-9224h 134m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending I0602 14:16:22.365697 1 main.go:147] CSR csr-kqgqt added I0602 14:16:22.379214 1 main.go:182] CSR csr-kqgqt not authorized: failed to find machine for node localhost I0602 14:16:22.379342 1 main.go:218] Error syncing csr csr-kqgqt: failed to find machine for node localhost I0602 14:16:23.659595 1 main.go:147] CSR csr-kqgqt added I0602 14:16:23.686881 1 main.go:182] CSR csr-kqgqt not authorized: failed to find machine for node localhost I0602 14:16:23.686902 1 main.go:218] Error syncing csr csr-kqgqt: failed to find machine for node localhost I0602 14:16:26.247082 1 main.go:147] CSR csr-kqgqt added I0602 14:16:26.261248 1 main.go:182] CSR csr-kqgqt not authorized: failed to find machine for node localhost I0602 14:16:26.261273 1 main.go:218] Error syncing csr csr-kqgqt: failed to find machine for node localhost Expected results: CSRs could be automatically approved Additional info:
Hey sunzhaohua can you clarify on the versions here? OCP 4.4 does not have support for automated machine management on vSphere, there's no a machine controller running.
thanks 4.5 makes sense. Seems instances are wrongly getting their hostname as "localhost". This might be relevant https://github.com/openshift/machine-config-operator/commit/41d45f0f7b4e6ec53c08ccbd83eefcca9a3e51ad https://github.com/openshift/machine-api-operator/pull/545 https://github.com/openshift/machine-api-operator/commit/2568c132e64b547e2cfb2001d7112b2dcdb9d7ce sunzhaohua could you please share must gather logs?
Also can you please elaborate on "On upi vsphere, do some configuration to enable machineset"?
http://file.rdu.redhat.com/~zhsun/must-gather.local.1975743706142649508.tar.gz
I am not quite sure about this bug, because it was found from an upgrade failed environment, from 4.4.6->4.5.0-0.nightly-2020-06-02-044312, I do not know whether it is related to the upgrade failure. Now I can't setup an environment to reproduce it because of vsphere resource limitation, once a new 4.5 upi vsphere environment can be created, I will retest it. upgrade failed bug: https://bugzilla.redhat.com/show_bug.cgi?id=1842906 (In reply to Alberto from comment #3) > Also can you please elaborate on "On upi vsphere, do some configuration to > enable machineset"? I modified machineset's "networkName", "template","folder" and added one tag in the vCenter, so that machine can be provisoned. providerSpec: value: apiVersion: vsphereprovider.openshift.io/v1beta1 credentialsSecret: name: vsphere-cloud-credentials diskGiB: 120 kind: VSphereMachineProviderSpec memoryMiB: 8192 metadata: creationTimestamp: null network: devices: - networkName: VM Network numCPUs: 2 numCoresPerSocket: 1 template: jima02032557-75-6tqzc-rhcos userDataSecret: name: worker-user-data workspace: datacenter: dc1 datastore: nvme-ds1 folder: /dc1/vm/upg-0602445 server: vcsa-qe.vmware.devcluster.openshift.com
reproduced this on a new upi vsphere cluster. cluster version: 4.5.0-0.nightly-2020-06-11-183238 steps: 1. setup an upi vsphere cluster 2. modified machineset's "networkName", "template","folder" and added one tag in the vCenter, so that machine can be provisoned. spec: metadata: {} providerSpec: value: apiVersion: vsphereprovider.openshift.io/v1beta1 credentialsSecret: name: vsphere-cloud-credentials diskGiB: 120 kind: VSphereMachineProviderSpec memoryMiB: 8192 metadata: creationTimestamp: null network: devices: - networkName: VM Network numCPUs: 2 numCoresPerSocket: 1 snapshot: "" template: qe-yhui-autodebug-rrnmq-rhcos userDataSecret: name: worker-user-data workspace: datacenter: dc1 datastore: 10TB-GOLD folder: /dc1/vm/huirwang-vsp45-96p77 server: vcsa2-qe.vmware.devcluster.openshift.com 3. check machines, csrs and logs $ oc get machine NAME PHASE TYPE REGION ZONE AGE huirwang-vsp45-96p77-worker-wt669 Provisioned 9m37s status: addresses: - address: 136.144.52.234 type: InternalIP - address: fe80::4106:ea04:a413:42bd type: InternalIP - address: huirwang-vsp45-96p77-worker-wt669 type: InternalDNS lastUpdated: "2020-06-15T10:18:28Z" phase: Provisioned providerStatus: conditions: - lastProbeTime: "2020-06-15T10:16:54Z" lastTransitionTime: "2020-06-15T10:16:54Z" message: Machine successfully created reason: MachineCreationSucceeded status: "True" type: MachineCreation instanceId: 422b0a49-d082-db74-74f0-fde22a9a4f47 instanceState: poweredOn taskRef: task-10291 $ oc get csr NAME AGE SIGNERNAME REQUESTOR CONDITION csr-2hkkm 16m kubernetes.io/kubelet-serving system:node:huirwang-vsp45-96p77-rhel-3 Pending csr-4q84d 7m44s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending csr-bn6cg 87s kubernetes.io/kubelet-serving system:node:huirwang-vsp45-96p77-rhel-3 Pending $ oc logs -f machine-approver-66cb75f6b7-rc5f4 -n openshift-cluster-machine-approver -c machine-approver-controller I0615 10:19:24.730292 1 main.go:147] CSR csr-4q84d added I0615 10:19:24.741154 1 main.go:182] CSR csr-4q84d not authorized: failed to find machine for node localhost I0615 10:19:24.741178 1 main.go:218] Error syncing csr csr-4q84d: failed to find machine for node localhost I0615 10:19:45.221349 1 main.go:147] CSR csr-4q84d added I0615 10:19:45.237049 1 main.go:182] CSR csr-4q84d not authorized: failed to find machine for node localhost I0615 10:19:45.237130 1 main.go:218] Error syncing csr csr-4q84d: failed to find machine for node localhost I0615 10:20:26.197355 1 main.go:147] CSR csr-4q84d added I0615 10:20:26.213210 1 main.go:182] CSR csr-4q84d not authorized: failed to find machine for node localhost I0615 10:20:26.213235 1 main.go:218] Error syncing csr csr-4q84d: failed to find machine for node localhost I0615 10:21:14.514410 1 main.go:147] CSR csr-2hkkm added I0615 10:21:14.543694 1 csr_check.go:418] retrieving serving cert from huirwang-vsp45-96p77-rhel-3 (136.144.52.241:10250) I0615 10:21:14.546744 1 csr_check.go:163] Found existing serving cert for huirwang-vsp45-96p77-rhel-3 W0615 10:21:14.546909 1 csr_check.go:172] Could not use current serving cert for renewal: CSR Subject Alternate Name values do not match current certificate W0615 10:21:14.546927 1 csr_check.go:173] Current SAN Values: [huirwang-vsp45-96p77-rhel-3 136.144.52.241], CSR SAN Values: [huirwang-vsp45-96p77-rhel-3 136.144.52.202 136.144.52.241] I0615 10:21:14.546939 1 csr_check.go:183] Falling back to machine-api authorization for huirwang-vsp45-96p77-rhel-3 I0615 10:21:14.546953 1 main.go:182] CSR csr-2hkkm not authorized: No target machine for node "huirwang-vsp45-96p77-rhel-3" I0615 10:21:14.546964 1 main.go:218] Error syncing csr csr-2hkkm: No target machine for node "huirwang-vsp45-96p77-rhel-3" I0615 10:21:48.133418 1 main.go:147] CSR csr-4q84d added I0615 10:21:48.149722 1 main.go:182] CSR csr-4q84d not authorized: failed to find machine for node localhost I0615 10:21:48.149746 1 main.go:218] Error syncing csr csr-4q84d: failed to find machine for node localhost I0615 10:24:31.989933 1 main.go:147] CSR csr-4q84d added I0615 10:24:32.011902 1 main.go:182] CSR csr-4q84d not authorized: failed to find machine for node localhost I0615 10:24:32.011930 1 main.go:218] Error syncing csr csr-4q84d: failed to find machine for node localhost
"failed to find machine for node localhost" This seems likely a problem with the instance not getting networking configured properly before creating the bootstrapping CSR. We'll look into this next sprint.
I don’t think this is set in a cluster that was originally created with a UPI process https://github.com/openshift/machine-config-operator/blob/master/templates/common/vsphere/files/vsphere-hostname.yaml#L9 That'd cause the script to not run. Therefore the hostname will be localhost and the machine won’t become a node. As a workaround a custom machineConfig could be generated. We need to evaluate dropping the check in the script to let it run in environments originally created via UPI.
Verified clusterversion: 4.6.0-0.nightly-2020-07-06-202123 $ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE zhsun77vsphere-fjzdv-worker 1 1 1 1 3h19m $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsun77vsphere-fjzdv-worker-cj6ln Running 88m $ oc get node NAME STATUS ROLES AGE VERSION compute-0 Ready worker 3h9m v1.18.3+1a1d81c control-plane-0 Ready master 3h19m v1.18.3+1a1d81c control-plane-1 Ready master 3h19m v1.18.3+1a1d81c control-plane-2 Ready master 3h19m v1.18.3+1a1d81c zhsun77vsphere-fjzdv-worker-cj6ln Ready worker 86m v1.18.3+1a1d81c
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196