Description of problem: EC2 Instances of type metal.m5 are not automatically approved by the node auto-approver. Version-Release number of selected component (if applicable): 4.1.2 How reproducible: Always Steps to Reproduce: 1. Install a new cluster 2. Add a new machineset with instance type metal.m5 to the cluster 3. Wait for nodes to join Actual results: New instances never join the cluster Expected results: A new node is added to the cluster Additional info: Manually approving the csr allows the new node to join.
This following is the pertinent output from the logs of cluster-machine-approver. I0625 18:33:27.288077 1 main.go:132] CSR csr-9tkc6 not authorized: CSR csr-9tkc6 creation time 2019-06-25 18:33:27 +0000 UTC not in range (2019-06-25 18:14:44 +0000 UTC, 2019-06-25 18:24:54 +0000 UTC)
Reassigning to the RHCOS team. We should ensure we are syncing the machine's time.
Hmm. I think we need to wait for time synchronization to occur in the initial pivot, and potentially in the second boot. See also https://github.com/openshift/machine-config-operator/issues/629 In this case, it would make sense for me for the workers to ask the masters for time sync too even before they've joined the cluster so the CSR is valid, but also after.
Caesar, does this reproduce 100% of the time (pun intended)?
So far from what we've seen, yes. At least with a few cluster installs (us-east-1 and us-west-1 I believe). No instance with type metal.m5 has been able to join the cluster. This is only in the past few days though, so it may be something that AWS resolves eventually.
Updating to 4.1.z since this was found in 4.1.2.
One important thing here is on !metal nodes on AWS, we use a paravirtualized clock source very early on boot, you'll see this in dmesg: ``` [ 0.960025] clocksource: Switched to clocksource xen ``` Similarly on e.g. KVM (booting via cosa run) I see: ``` [ 0.564183] clocksource: Switched to clocksource kvm-clock ``` But this isn't true on metal.
I am not sure that this is an issue with clock skew. The timestamps being compared are the creation times on the Machine resource and the CSR. Both of those timestamps should be set by the API server, not by the machine being built. Is it possible that it is taking more that 10 minutes between when the Machine resource is created and when the CSR is created?
> Both of those timestamps should be set by the API server, not by the machine being built Ah, OK; I jumped to the conclusion around node time skew. > I am not sure that this is an issue with clock skew. The timestamps being compared are the creation times on the Machine resource and the CSR. Both of those timestamps should be set by the API server, not by the machine being built. Is it possible that it is taking more that 10 minutes between when the Machine resource is created and when the CSR is created? For metal instances it also seems highly likely to me indeed that provisioning the machine takes much longer than VMs. I bet that metal instances are basically always occupied by "spot" work and provisioning them needs to wait for the spot instances to be evicted.
While I think an eventual fix for this BZ might involve MCO improvements, today the primary responsibility is in https://github.com/openshift/cluster-machine-approver which I think is the Auth component. Please reassign if I'm wrong.
This should have been fixed in master by https://github.com/openshift/cluster-machine-approver/pull/37 Additionally this has impacted the workflow as well https://github.com/openshift/cluster-machine-approver/pull/43 github.com/openshift/cluster-machine-approver/pull/41
Verified in 4.3.0-0.nightly-2019-11-13-233341 The m5.metal is supported, and csr can be approved. oc get machines NAME PHASE TYPE REGION ZONE AGE jhou1-99kqt-master-0 Running m4.xlarge ap-northeast-1 ap-northeast-1a 125m jhou1-99kqt-master-1 Running m4.xlarge ap-northeast-1 ap-northeast-1c 125m jhou1-99kqt-master-2 Running m4.xlarge ap-northeast-1 ap-northeast-1d 125m jhou1-99kqt-worker-ap-northeast-1a-stl9s Running m5.metal ap-northeast-1 ap-northeast-1a 61m jhou1-99kqt-worker-ap-northeast-1c-84m6l Running m4.large ap-northeast-1 ap-northeast-1c 121m jhou1-99kqt-worker-ap-northeast-1d-f9glj Running m4.large ap-northeast-1 ap-northeast-1d 121m oc describe node ip-10-0-136-76.ap-northeast-1.compute.internal Name: ip-10-0-136-76.ap-northeast-1.compute.internal Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=m5.metal beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=ap-northeast-1 failure-domain.beta.kubernetes.io/zone=ap-northeast-1a kubernetes.io/arch=amd64 kubernetes.io/hostname=ip-10-0-136-76 kubernetes.io/os=linux node-role.kubernetes.io/worker= node.openshift.io/os_id=rhcos Annotations: machine.openshift.io/machine: openshift-machine-api/jhou1-99kqt-worker-ap-northeast-1a-stl9s machineconfiguration.openshift.io/currentConfig: rendered-worker-1f7e7fed3eb97273ed5f60cdff96b22e machineconfiguration.openshift.io/desiredConfig: rendered-worker-1f7e7fed3eb97273ed5f60cdff96b22e machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: true
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062