Bug 1843384 - [upi vsphere] Workers nodes CSRs are not automatically approved
Summary: [upi vsphere] Workers nodes CSRs are not automatically approved
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Alexander Demicev
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks: 1861773
TreeView+ depends on / blocked
 
Reported: 2020-06-03 08:00 UTC by sunzhaohua
Modified: 2020-10-27 16:05 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1861773 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:04:37 UTC
Target Upstream Version:
agarcial: needinfo+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1894 0 None closed Bug 1843384: Remove IPI checks for vsphere hostname script and systemd unit 2021-02-18 02:24:24 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:04:59 UTC

Description sunzhaohua 2020-06-03 08:00:02 UTC
Description of problem:
On upi vsphere, do some configuration to enable machineset, machine stuck in Provisioned status, csr are not automatically approved. 

Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-06-02-044312

How reproducible:
Always

Steps to Reproduce:
1. Do some configuration to enable machineset
2. Check machine status
3. Check csrs

Actual results:
Csrs are not automatically approved, machine stuck in Provisioned status.

$ oc get machine
NAME                             PHASE         TYPE   REGION   ZONE   AGE
upg-0602445-762m2-worker-5465w   Provisioned                          4h3m

$ oc get csr
NAME        AGE     SIGNERNAME                                    REQUESTOR                                                                   CONDITION
csr-5jpx9   56m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-82nm8   10m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-9224h   134m    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending

I0602 14:16:22.365697       1 main.go:147] CSR csr-kqgqt added
I0602 14:16:22.379214       1 main.go:182] CSR csr-kqgqt not authorized: failed to find machine for node localhost
I0602 14:16:22.379342       1 main.go:218] Error syncing csr csr-kqgqt: failed to find machine for node localhost
I0602 14:16:23.659595       1 main.go:147] CSR csr-kqgqt added
I0602 14:16:23.686881       1 main.go:182] CSR csr-kqgqt not authorized: failed to find machine for node localhost
I0602 14:16:23.686902       1 main.go:218] Error syncing csr csr-kqgqt: failed to find machine for node localhost
I0602 14:16:26.247082       1 main.go:147] CSR csr-kqgqt added
I0602 14:16:26.261248       1 main.go:182] CSR csr-kqgqt not authorized: failed to find machine for node localhost
I0602 14:16:26.261273       1 main.go:218] Error syncing csr csr-kqgqt: failed to find machine for node localhost

Expected results:
CSRs could be automatically approved

Additional info:

Comment 1 Alberto 2020-06-03 08:18:43 UTC
Hey sunzhaohua can you clarify on the versions here? OCP 4.4 does not have support for automated machine management on vSphere, there's no a machine controller running.

Comment 2 Alberto 2020-06-03 08:57:51 UTC
thanks 4.5 makes sense. Seems instances are wrongly getting their hostname as "localhost".
This might be relevant
https://github.com/openshift/machine-config-operator/commit/41d45f0f7b4e6ec53c08ccbd83eefcca9a3e51ad
https://github.com/openshift/machine-api-operator/pull/545
https://github.com/openshift/machine-api-operator/commit/2568c132e64b547e2cfb2001d7112b2dcdb9d7ce

sunzhaohua could you please share must gather logs?

Comment 3 Alberto 2020-06-03 09:01:45 UTC
Also can you please elaborate on "On upi vsphere, do some configuration to enable machineset"?

Comment 5 sunzhaohua 2020-06-04 01:50:09 UTC
I am not quite sure about this bug, because it was found from an upgrade failed environment, from 4.4.6->4.5.0-0.nightly-2020-06-02-044312, I do not know whether it is related to the upgrade failure. Now I can't setup an environment to reproduce it because of vsphere resource limitation, once a new 4.5 upi vsphere environment can be created, I will retest it.
upgrade failed bug:  https://bugzilla.redhat.com/show_bug.cgi?id=1842906
(In reply to Alberto from comment #3)
> Also can you please elaborate on "On upi vsphere, do some configuration to
> enable machineset"?

I modified machineset's "networkName", "template","folder" and added one tag in the vCenter, so that machine can be provisoned.
      providerSpec:
        value:
          apiVersion: vsphereprovider.openshift.io/v1beta1
          credentialsSecret:
            name: vsphere-cloud-credentials
          diskGiB: 120
          kind: VSphereMachineProviderSpec
          memoryMiB: 8192
          metadata:
            creationTimestamp: null
          network:
            devices:
            - networkName: VM Network
          numCPUs: 2
          numCoresPerSocket: 1
          template: jima02032557-75-6tqzc-rhcos
          userDataSecret:
            name: worker-user-data
          workspace:
            datacenter: dc1
            datastore: nvme-ds1
            folder: /dc1/vm/upg-0602445
            server: vcsa-qe.vmware.devcluster.openshift.com

Comment 6 sunzhaohua 2020-06-15 10:27:40 UTC
reproduced this on a new upi vsphere cluster.
cluster version: 4.5.0-0.nightly-2020-06-11-183238
steps: 
1. setup an upi vsphere cluster
2. modified machineset's "networkName", "template","folder" and added one tag in the vCenter, so that machine can be provisoned.

    spec:
      metadata: {}
      providerSpec:
        value:
          apiVersion: vsphereprovider.openshift.io/v1beta1
          credentialsSecret:
            name: vsphere-cloud-credentials
          diskGiB: 120
          kind: VSphereMachineProviderSpec
          memoryMiB: 8192
          metadata:
            creationTimestamp: null
          network:
            devices:
            - networkName: VM Network
          numCPUs: 2
          numCoresPerSocket: 1
          snapshot: ""
          template: qe-yhui-autodebug-rrnmq-rhcos
          userDataSecret:
            name: worker-user-data
          workspace:
            datacenter: dc1
            datastore: 10TB-GOLD
            folder: /dc1/vm/huirwang-vsp45-96p77
            server: vcsa2-qe.vmware.devcluster.openshift.com

3. check machines, csrs and logs

$ oc get machine
NAME                                PHASE         TYPE   REGION   ZONE   AGE
huirwang-vsp45-96p77-worker-wt669   Provisioned                          9m37s

  status:
    addresses:
    - address: 136.144.52.234
      type: InternalIP
    - address: fe80::4106:ea04:a413:42bd
      type: InternalIP
    - address: huirwang-vsp45-96p77-worker-wt669
      type: InternalDNS
    lastUpdated: "2020-06-15T10:18:28Z"
    phase: Provisioned
    providerStatus:
      conditions:
      - lastProbeTime: "2020-06-15T10:16:54Z"
        lastTransitionTime: "2020-06-15T10:16:54Z"
        message: Machine successfully created
        reason: MachineCreationSucceeded
        status: "True"
        type: MachineCreation
      instanceId: 422b0a49-d082-db74-74f0-fde22a9a4f47
      instanceState: poweredOn
      taskRef: task-10291

$ oc get csr
NAME        AGE     SIGNERNAME                                    REQUESTOR                                                                   CONDITION
csr-2hkkm   16m     kubernetes.io/kubelet-serving                 system:node:huirwang-vsp45-96p77-rhel-3                                     Pending
csr-4q84d   7m44s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-bn6cg   87s     kubernetes.io/kubelet-serving                 system:node:huirwang-vsp45-96p77-rhel-3                                     Pending

$ oc logs -f machine-approver-66cb75f6b7-rc5f4 -n openshift-cluster-machine-approver -c machine-approver-controller
I0615 10:19:24.730292       1 main.go:147] CSR csr-4q84d added
I0615 10:19:24.741154       1 main.go:182] CSR csr-4q84d not authorized: failed to find machine for node localhost
I0615 10:19:24.741178       1 main.go:218] Error syncing csr csr-4q84d: failed to find machine for node localhost
I0615 10:19:45.221349       1 main.go:147] CSR csr-4q84d added
I0615 10:19:45.237049       1 main.go:182] CSR csr-4q84d not authorized: failed to find machine for node localhost
I0615 10:19:45.237130       1 main.go:218] Error syncing csr csr-4q84d: failed to find machine for node localhost
I0615 10:20:26.197355       1 main.go:147] CSR csr-4q84d added
I0615 10:20:26.213210       1 main.go:182] CSR csr-4q84d not authorized: failed to find machine for node localhost
I0615 10:20:26.213235       1 main.go:218] Error syncing csr csr-4q84d: failed to find machine for node localhost
I0615 10:21:14.514410       1 main.go:147] CSR csr-2hkkm added
I0615 10:21:14.543694       1 csr_check.go:418] retrieving serving cert from huirwang-vsp45-96p77-rhel-3 (136.144.52.241:10250)
I0615 10:21:14.546744       1 csr_check.go:163] Found existing serving cert for huirwang-vsp45-96p77-rhel-3
W0615 10:21:14.546909       1 csr_check.go:172] Could not use current serving cert for renewal: CSR Subject Alternate Name values do not match current certificate
W0615 10:21:14.546927       1 csr_check.go:173] Current SAN Values: [huirwang-vsp45-96p77-rhel-3 136.144.52.241], CSR SAN Values: [huirwang-vsp45-96p77-rhel-3 136.144.52.202 136.144.52.241]
I0615 10:21:14.546939       1 csr_check.go:183] Falling back to machine-api authorization for huirwang-vsp45-96p77-rhel-3
I0615 10:21:14.546953       1 main.go:182] CSR csr-2hkkm not authorized: No target machine for node "huirwang-vsp45-96p77-rhel-3"
I0615 10:21:14.546964       1 main.go:218] Error syncing csr csr-2hkkm: No target machine for node "huirwang-vsp45-96p77-rhel-3"
I0615 10:21:48.133418       1 main.go:147] CSR csr-4q84d added
I0615 10:21:48.149722       1 main.go:182] CSR csr-4q84d not authorized: failed to find machine for node localhost
I0615 10:21:48.149746       1 main.go:218] Error syncing csr csr-4q84d: failed to find machine for node localhost
I0615 10:24:31.989933       1 main.go:147] CSR csr-4q84d added
I0615 10:24:32.011902       1 main.go:182] CSR csr-4q84d not authorized: failed to find machine for node localhost
I0615 10:24:32.011930       1 main.go:218] Error syncing csr csr-4q84d: failed to find machine for node localhost

Comment 7 Alberto 2020-06-19 07:55:39 UTC
"failed to find machine for node localhost" This seems likely a problem with the instance not getting networking configured properly before creating the bootstrapping CSR. We'll look into this next sprint.

Comment 8 Alberto 2020-07-01 12:13:29 UTC
I don’t think this is set in a cluster that was originally created with a UPI process https://github.com/openshift/machine-config-operator/blob/master/templates/common/vsphere/files/vsphere-hostname.yaml#L9
That'd cause the script to not run. Therefore the hostname will be localhost and the machine won’t become a node.
As a workaround a custom machineConfig could be generated.
We need to evaluate dropping the check in the script to let it run in environments originally created via UPI.

Comment 11 sunzhaohua 2020-07-07 08:11:31 UTC
Verified
clusterversion: 4.6.0-0.nightly-2020-07-06-202123
$ oc get machineset
NAME                          DESIRED   CURRENT   READY   AVAILABLE   AGE
zhsun77vsphere-fjzdv-worker   1         1         1       1           3h19m

$ oc get machine
NAME                                PHASE          TYPE   REGION   ZONE   AGE
zhsun77vsphere-fjzdv-worker-cj6ln   Running                               88m

$ oc get node
NAME                                STATUS   ROLES    AGE     VERSION
compute-0                           Ready    worker   3h9m    v1.18.3+1a1d81c
control-plane-0                     Ready    master   3h19m   v1.18.3+1a1d81c
control-plane-1                     Ready    master   3h19m   v1.18.3+1a1d81c
control-plane-2                     Ready    master   3h19m   v1.18.3+1a1d81c
zhsun77vsphere-fjzdv-worker-cj6ln   Ready    worker   86m     v1.18.3+1a1d81c

Comment 13 errata-xmlrpc 2020-10-27 16:04:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.