Bug 1723955
| Summary: | Instances of type metal.m5 are not automatically approved by the node auto-approver | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Cesar Wong <cewong> |
| Component: | Cloud Compute | Assignee: | Alberto <agarcial> |
| Status: | CLOSED ERRATA | QA Contact: | Jianwei Hou <jhou> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.1.z | CC: | aos-bugs, bbreard, dustymabe, imcleod, jligon, mfojtik, mstaeble, nobody+zeenix, nstielau, sdodson, tparikh, walters, wmeng, zeenix |
| Target Milestone: | --- | ||
| Target Release: | 4.3.0 | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-01-23 11:04:15 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Cesar Wong
2019-06-25 20:38:31 UTC
This following is the pertinent output from the logs of cluster-machine-approver. I0625 18:33:27.288077 1 main.go:132] CSR csr-9tkc6 not authorized: CSR csr-9tkc6 creation time 2019-06-25 18:33:27 +0000 UTC not in range (2019-06-25 18:14:44 +0000 UTC, 2019-06-25 18:24:54 +0000 UTC) Reassigning to the RHCOS team. We should ensure we are syncing the machine's time. Hmm. I think we need to wait for time synchronization to occur in the initial pivot, and potentially in the second boot. See also https://github.com/openshift/machine-config-operator/issues/629 In this case, it would make sense for me for the workers to ask the masters for time sync too even before they've joined the cluster so the CSR is valid, but also after. Caesar, does this reproduce 100% of the time (pun intended)? So far from what we've seen, yes. At least with a few cluster installs (us-east-1 and us-west-1 I believe). No instance with type metal.m5 has been able to join the cluster. This is only in the past few days though, so it may be something that AWS resolves eventually. Updating to 4.1.z since this was found in 4.1.2. One important thing here is on !metal nodes on AWS, we use a paravirtualized clock source very early on boot, you'll see this in dmesg: ``` [ 0.960025] clocksource: Switched to clocksource xen ``` Similarly on e.g. KVM (booting via cosa run) I see: ``` [ 0.564183] clocksource: Switched to clocksource kvm-clock ``` But this isn't true on metal. I am not sure that this is an issue with clock skew. The timestamps being compared are the creation times on the Machine resource and the CSR. Both of those timestamps should be set by the API server, not by the machine being built. Is it possible that it is taking more that 10 minutes between when the Machine resource is created and when the CSR is created? > Both of those timestamps should be set by the API server, not by the machine being built Ah, OK; I jumped to the conclusion around node time skew. > I am not sure that this is an issue with clock skew. The timestamps being compared are the creation times on the Machine resource and the CSR. Both of those timestamps should be set by the API server, not by the machine being built. Is it possible that it is taking more that 10 minutes between when the Machine resource is created and when the CSR is created? For metal instances it also seems highly likely to me indeed that provisioning the machine takes much longer than VMs. I bet that metal instances are basically always occupied by "spot" work and provisioning them needs to wait for the spot instances to be evicted. While I think an eventual fix for this BZ might involve MCO improvements, today the primary responsibility is in https://github.com/openshift/cluster-machine-approver which I think is the Auth component. Please reassign if I'm wrong. This should have been fixed in master by https://github.com/openshift/cluster-machine-approver/pull/37 Additionally this has impacted the workflow as well https://github.com/openshift/cluster-machine-approver/pull/43 github.com/openshift/cluster-machine-approver/pull/41 Verified in 4.3.0-0.nightly-2019-11-13-233341
The m5.metal is supported, and csr can be approved.
oc get machines
NAME PHASE TYPE REGION ZONE AGE
jhou1-99kqt-master-0 Running m4.xlarge ap-northeast-1 ap-northeast-1a 125m
jhou1-99kqt-master-1 Running m4.xlarge ap-northeast-1 ap-northeast-1c 125m
jhou1-99kqt-master-2 Running m4.xlarge ap-northeast-1 ap-northeast-1d 125m
jhou1-99kqt-worker-ap-northeast-1a-stl9s Running m5.metal ap-northeast-1 ap-northeast-1a 61m
jhou1-99kqt-worker-ap-northeast-1c-84m6l Running m4.large ap-northeast-1 ap-northeast-1c 121m
jhou1-99kqt-worker-ap-northeast-1d-f9glj Running m4.large ap-northeast-1 ap-northeast-1d 121m
oc describe node ip-10-0-136-76.ap-northeast-1.compute.internal
Name: ip-10-0-136-76.ap-northeast-1.compute.internal
Roles: worker
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=m5.metal
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=ap-northeast-1
failure-domain.beta.kubernetes.io/zone=ap-northeast-1a
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-10-0-136-76
kubernetes.io/os=linux
node-role.kubernetes.io/worker=
node.openshift.io/os_id=rhcos
Annotations: machine.openshift.io/machine: openshift-machine-api/jhou1-99kqt-worker-ap-northeast-1a-stl9s
machineconfiguration.openshift.io/currentConfig: rendered-worker-1f7e7fed3eb97273ed5f60cdff96b22e
machineconfiguration.openshift.io/desiredConfig: rendered-worker-1f7e7fed3eb97273ed5f60cdff96b22e
machineconfiguration.openshift.io/reason:
machineconfiguration.openshift.io/state: Done
volumes.kubernetes.io/controller-managed-attach-detach: true
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062 |