Bug 1723955 - Instances of type metal.m5 are not automatically approved by the node auto-approver
Summary: Instances of type metal.m5 are not automatically approved by the node auto-ap...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.1.z
Hardware: All
OS: All
unspecified
medium
Target Milestone: ---
: 4.3.0
Assignee: Alberto
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-06-25 20:38 UTC by Cesar Wong
Modified: 2020-04-08 16:33 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-01-23 11:04:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-01-23 11:04:40 UTC

Description Cesar Wong 2019-06-25 20:38:31 UTC
Description of problem:
EC2 Instances of type metal.m5 are not automatically approved by the node auto-approver.

Version-Release number of selected component (if applicable):
4.1.2

How reproducible:
Always

Steps to Reproduce:
1. Install a new cluster
2. Add a new machineset with instance type metal.m5 to the cluster
3. Wait for nodes to join

Actual results:
New instances never join the cluster

Expected results:
A new node is added to the cluster

Additional info:
Manually approving the csr allows the new node to join.

Comment 1 Matthew Staebler 2019-06-25 21:39:17 UTC
This following is the pertinent output from the logs of cluster-machine-approver.

I0625 18:33:27.288077       1 main.go:132] CSR csr-9tkc6 not authorized: CSR csr-9tkc6 creation time 2019-06-25 18:33:27 +0000 UTC not in range (2019-06-25 18:14:44 +0000 UTC, 2019-06-25 18:24:54 +0000 UTC)

Comment 2 Cesar Wong 2019-06-26 16:02:33 UTC
Reassigning to the RHCOS team. We should ensure we are syncing the machine's time.

Comment 3 Colin Walters 2019-06-26 16:52:51 UTC
Hmm.  I think we need to wait for time synchronization to occur in the initial pivot, and potentially in the second boot.

See also https://github.com/openshift/machine-config-operator/issues/629
In this case, it would make sense for me for the workers to ask the masters for time sync too even before they've joined the cluster so the CSR is valid, but also after.

Comment 4 Colin Walters 2019-06-26 17:20:02 UTC
Caesar, does this reproduce 100% of the time (pun intended)?

Comment 5 Cesar Wong 2019-06-26 17:57:55 UTC
So far from what we've seen, yes. At least with a few cluster installs (us-east-1 and us-west-1 I believe). No instance with type metal.m5 has been able to join the cluster. This is only in the past few days though, so it may be something that AWS resolves eventually.

Comment 6 Steve Milner 2019-06-26 18:07:21 UTC
Updating to 4.1.z since this was found in 4.1.2.

Comment 7 Colin Walters 2019-07-01 18:42:33 UTC
One important thing here is on !metal nodes on AWS, we use a paravirtualized clock source very early on boot, you'll see this in dmesg:

```
[    0.960025] clocksource: Switched to clocksource xen
```

Similarly on e.g. KVM (booting via cosa run) I see:

```
[    0.564183] clocksource: Switched to clocksource kvm-clock
```

But this isn't true on metal.

Comment 8 Matthew Staebler 2019-07-03 13:18:14 UTC
I am not sure that this is an issue with clock skew. The timestamps being compared are the creation times on the Machine resource and the CSR. Both of those timestamps should be set by the API server, not by the machine being built. Is it possible that it is taking more that 10 minutes between when the Machine resource is created and when the CSR is created?

Comment 9 Colin Walters 2019-07-04 13:21:04 UTC
>  Both of those timestamps should be set by the API server, not by the machine being built

Ah, OK; I jumped to the conclusion around node time skew. 

> I am not sure that this is an issue with clock skew. The timestamps being compared are the creation times on the Machine resource and the CSR. Both of those timestamps should be set by the API server, not by the machine being built. Is it possible that it is taking more that 10 minutes between when the Machine resource is created and when the CSR is created?

For metal instances it also seems highly likely to me indeed that provisioning the machine takes much longer than VMs.  I bet that metal instances are basically always occupied by "spot" work and provisioning them needs to wait for the spot instances to be evicted.

Comment 11 Colin Walters 2019-08-29 16:28:25 UTC
While I think an eventual fix for this BZ might involve MCO improvements, today the primary responsibility is in
https://github.com/openshift/cluster-machine-approver
which I think is the Auth component.  Please reassign if I'm wrong.

Comment 14 Alberto 2019-11-08 10:29:31 UTC
This should have been fixed in master by https://github.com/openshift/cluster-machine-approver/pull/37
Additionally this has impacted the workflow as well
https://github.com/openshift/cluster-machine-approver/pull/43
github.com/openshift/cluster-machine-approver/pull/41

Comment 16 Jianwei Hou 2019-11-15 09:38:01 UTC
Verified in 4.3.0-0.nightly-2019-11-13-233341

The m5.metal is supported, and csr can be approved.

oc get machines
NAME                                       PHASE     TYPE        REGION           ZONE              AGE
jhou1-99kqt-master-0                       Running   m4.xlarge   ap-northeast-1   ap-northeast-1a   125m
jhou1-99kqt-master-1                       Running   m4.xlarge   ap-northeast-1   ap-northeast-1c   125m
jhou1-99kqt-master-2                       Running   m4.xlarge   ap-northeast-1   ap-northeast-1d   125m
jhou1-99kqt-worker-ap-northeast-1a-stl9s   Running   m5.metal    ap-northeast-1   ap-northeast-1a   61m
jhou1-99kqt-worker-ap-northeast-1c-84m6l   Running   m4.large    ap-northeast-1   ap-northeast-1c   121m
jhou1-99kqt-worker-ap-northeast-1d-f9glj   Running   m4.large    ap-northeast-1   ap-northeast-1d   121m

oc describe node ip-10-0-136-76.ap-northeast-1.compute.internal
Name:               ip-10-0-136-76.ap-northeast-1.compute.internal
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m5.metal
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=ap-northeast-1
                    failure-domain.beta.kubernetes.io/zone=ap-northeast-1a
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-136-76
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.openshift.io/os_id=rhcos
Annotations:        machine.openshift.io/machine: openshift-machine-api/jhou1-99kqt-worker-ap-northeast-1a-stl9s
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-1f7e7fed3eb97273ed5f60cdff96b22e
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-1f7e7fed3eb97273ed5f60cdff96b22e
                    machineconfiguration.openshift.io/reason:
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true

Comment 18 errata-xmlrpc 2020-01-23 11:04:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.