Bug 1965034 - Ignition fails to get machine config for worker nodes
Summary: Ignition fails to get machine config for worker nodes
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: aos-install
QA Contact: Gaoyun Pei
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-26 15:24 UTC by Arjun Naik
Modified: 2023-09-15 01:07 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-21 17:32:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Arjun Naik 2021-05-26 15:24:16 UTC
Description of problem:
The worker nodes fails to start correctly. The following is seen in the machine-config-operator logs

I0520 16:42:29.319204       1 controller.go:168] james2-sts-0520-7ztw4-worker-us-east-1a-d8drn: reconciling Machine
I0520 16:42:29.319233       1 actuator.go:104] james2-sts-0520-7ztw4-worker-us-east-1a-d8drn: actuator checking if machine exists
I0520 16:42:29.397262       1 reconciler.go:414] james2-sts-0520-7ztw4-worker-us-east-1a-d8drn: Found instance by id: i-04c681c46c9397220
I0520 16:42:29.397289       1 controller.go:276] james2-sts-0520-7ztw4-worker-us-east-1a-d8drn: reconciling machine triggers idempotent update
I0520 16:42:29.397294       1 actuator.go:120] james2-sts-0520-7ztw4-worker-us-east-1a-d8drn: actuator updating machine
I0520 16:42:29.398096       1 reconciler.go:155] james2-sts-0520-7ztw4-worker-us-east-1a-d8drn: updating machine
I0520 16:42:29.493481       1 reconciler.go:414] james2-sts-0520-7ztw4-worker-us-east-1a-d8drn: Found instance by id: i-04c681c46c9397220
I0520 16:42:29.493507       1 reconciler.go:197] james2-sts-0520-7ztw4-worker-us-east-1a-d8drn: found 1 running instances for machine
I0520 16:42:29.493518       1 reconciler.go:338] james2-sts-0520-7ztw4-worker-us-east-1a-d8drn: ProviderID already set in the machine Spec with value:aws:///us-east-1a/i-04c681c46c9397220
I0520 16:42:29.493888       1 reconciler.go:227] Updated machine james2-sts-0520-7ztw4-worker-us-east-1a-d8drn
I0520 16:42:29.493902       1 machine_scope.go:140] james2-sts-0520-7ztw4-worker-us-east-1a-d8drn: Updating status
I0520 16:42:29.573176       1 machine_scope.go:167] james2-sts-0520-7ztw4-worker-us-east-1a-d8drn: finished calculating AWS status
I0520 16:42:29.573195       1 machine_scope.go:86] james2-sts-0520-7ztw4-worker-us-east-1a-d8drn: patching machine
I0520 16:42:29.591227       1 controller.go:292] james2-sts-0520-7ztw4-worker-us-east-1a-d8drn: has no node yet, requeuing
I0520 16:42:29.591290       1 controller.go:168] james2-sts-0520-7ztw4-worker-us-east-1a-wzltl: reconciling Machine
I0520 16:42:29.591300       1 actuator.go:104] james2-sts-0520-7ztw4-worker-us-east-1a-wzltl: actuator checking if machine exists
I0520 16:42:29.694426       1 reconciler.go:414] james2-sts-0520-7ztw4-worker-us-east-1a-wzltl: Found instance by id: i-066c6fc306e69ff29
I0520 16:42:29.694448       1 controller.go:276] james2-sts-0520-7ztw4-worker-us-east-1a-wzltl: reconciling machine triggers idempotent update
I0520 16:42:29.694453       1 actuator.go:120] james2-sts-0520-7ztw4-worker-us-east-1a-wzltl: actuator updating machine
I0520 16:42:29.695269       1 reconciler.go:155] james2-sts-0520-7ztw4-worker-us-east-1a-wzltl: updating machine
I0520 16:42:29.764509       1 reconciler.go:414] james2-sts-0520-7ztw4-worker-us-east-1a-wzltl: Found instance by id: i-066c6fc306e69ff29
I0520 16:42:29.764531       1 reconciler.go:197] james2-sts-0520-7ztw4-worker-us-east-1a-wzltl: found 1 running instances for machine
I0520 16:42:29.764540       1 reconciler.go:338] james2-sts-0520-7ztw4-worker-us-east-1a-wzltl: ProviderID already set in the machine Spec with value:aws:///us-east-1a/i-066c6fc306e69ff29
I0520 16:42:29.764895       1 reconciler.go:227] Updated machine james2-sts-0520-7ztw4-worker-us-east-1a-wzltl
I0520 16:42:29.764907       1 machine_scope.go:140] james2-sts-0520-7ztw4-worker-us-east-1a-wzltl: Updating status
I0520 16:42:30.004769       1 machine_scope.go:167] james2-sts-0520-7ztw4-worker-us-east-1a-wzltl: finished calculating AWS status
I0520 16:42:30.004790       1 machine_scope.go:86] james2-sts-0520-7ztw4-worker-us-east-1a-wzltl: patching machine
I0520 16:42:30.021480       1 controller.go:292] james2-sts-0520-7ztw4-worker-us-east-1a-wzltl: has no node yet, requeuing

The node in question is completely unreachable because ignition has failed. Since this was an AWS instance I could get the system log which contained:

[K[[0;1;31m*[0m[0;31m*    [0m] A start job is running for Ignition (fetch) (5min 27s / no limit)
[K[[0;31m*[0;1;31m*[0m[0;31m*   [0m] A start job is running for Ignition (fetch) (5min 27s / no limit)
[K[ [0;31m*[0;1;31m*[0m[0;31m*  [0m] A start job is running for Ignition (fetch) (5min 28s / no limit)
[K[  [0;31m*[0;1;31m*[0m[0;31m* [0m] A start job is running for Ignition (fetch) (5min 28s / no limit)
[K[   [0;31m*[0;1;31m*[0m[0;31m*[0m] A start job is running for Ignition (fetch) (5min 29s / no limit)
[K[    [0;31m*[0;1;31m*[0m] A start job is running for Ignition (fetch) (5min 29s / no limit)
[K[     [0;31m*[0m] A start job is running for Ignition (fetch) (5min 30s / no limit)
[K[    [0;31m*[0;1;31m*[0m] A start job is running for Ignition (fetch) (5min 30s / no limit)
[K[   [0;31m*[0;1;31m*[0m[0;31m*[0m] A start job is running for Ignition (fetch) (5min 31s / no limit)
[K[  [0;31m*[0;1;31m*[0m[0;31m* [0m] A start job is runn[2021-05-20T14:00:31.910918]ing for Ignition (fetch) (5min 31s / no limit)[  336.240377] ignition[701]: INFO     : GET https://api-int.james2-sts-0520.mywk.s1.devshift.org:22623/config/worker: attempt #70
[  336.252779] ignition[701]: INFO     : GET result: Internal Server Error
[K[ [0;31m*[0;1;31m*[0m[0;31m*  [0m] A start job is running for Ignition (fetch) (5min 32s / no limit)
[K[[0;31m*[0;1;31m*[0m[0;31m*   [0m] A start job is running for Ignition (fetch) (5min 32s / no limit)
[K[[0;1;31m*[0m[0;31m*    [0m] A start job is running for Ignition (fetch) (5min 33s / no limit)
[K[[0m[0;31m*     [0m] A start job is running for Ignition (fetch) (5min 33s / no limit)
[K[[0;1;31m*[0m[0;31m*    [0m] A start job is running for Ignition (fetch) (5min 34s / no limit)
[K[[0;31m*[0;1;31m*[0m[0;31m*   [0m] A start job is running for Ignition (fetch) (5min 34s / no limit)
[K[ [0;31m*[0;1;31m*[0m[0;31m*  [0m] A start job is running for Ignition (fetch) (5min 35s / no limit)
[K[  [0;31m*[0;1;31m*[0m[0;31m* [0m] A start job is running for Ignition (fetch) (5min 35s / no limit)
[K[   [0;31m*[0;1;31m*[0m[0;31m*[0m] A start job is running for Ignition (fetch) (5min 36s / no limit)
[K[    [0;31m*[0;1;31m*[0m] A start job[2021-05-20T14:00:36.910933] is running for Ignition (fetch) (5min 36s / no limit)[  341.241434] ignition[701]: INFO     : GET https://api-int.james2-sts-0520.mywk.s1.devshift.org:22623/config/worker: attempt #71
[  341.252631] ignition[701]: INFO     : GET result: Internal Server Error
[K[     [0;31m*[0m] A start job is running for Ignition (fetch) (5min 37s / no limit)
[K[    [0;31m*[0;1;31m*[0m] A start job is running for Ignition (fetch) (5min 37s / no limit)
[K[   [0;31m*[0;1;31m*[0m[0;31m*[0m] A start job is running for Ignition (fetch) (5min 38s / no limit)
[K[  [0;31m*[0;1;31m*[0m[0;31m* [0m] A start job is running for Ignition (fetch) (5min 38s / no limit)
[K[ [0;31m*[0;1;31m*[0m[0;31m*  [0m] A start job is running for Ignition (fetch) (5min 39s / no limit)
[K[[0;31m*[0;1;31m*[0m[0;31m*   [0m] A start job is running for Ignition (fetch) (5min 39s / no limit)
[K[[0;1;31m*[0m[0;31m*    [0m] A start job is running for Ignition (fetch) (5min 40s / no limit)
[K[[0m[0;31m*     [0m] A start job is running for Ignition (fetch) (5min 40s / no limit)
[K[[0;1;31m*[0m[0;31m*    [0m] A start job is running for Ignition (fetch) (5min 41s / no limit)
[K[[0;31m*[0;1;31m*[0m[0;31m*   [0m] A start job is running for Igni[2021-05-20T14:00:41.910944]tion (fetch) (5min 41s / no limit)[  346.242398] ignition[701]: INFO     : GET https://api-int.james2-sts-0520.mywk.s1.devshift.org:22623/config/worker: attempt #72
[  346.253541] ignition[701]: INFO     : GET result: Internal Server Error

curling the endpoint https://api-int.james2-sts-0520.mywk.s1.devshift.org:22623/config/worker returns a response successfully.

There are also no error messages in the machine-config-server logs

[core@ip-10-0-211-18 ~]$ oc logs -n openshift-machine-config-operator machine-config-server-zbvqp
I0520 13:54:03.781041       1 start.go:38] Version: v4.7.0-202105111858.p0-dirty (e3863b02b7403342cdf0f981889e8c3cfc2d86bb)
I0520 13:54:03.784420       1 api.go:72] Launching server on :22624
I0520 13:54:03.784425       1 api.go:72] Launching server on :22623
I0520 16:53:19.128569       1 api.go:117] Pool worker requested by address:"10.0.240.111:16856" User-Agent:"curl/7.68.0" Accept-Header: "*/*"
[core@ip-10-0-211-18 ~]$ oc logs -n openshift-machine-config-operator machine-config-server-zz6ff
I0520 13:54:02.307965       1 start.go:38] Version: v4.7.0-202105111858.p0-dirty (e3863b02b7403342cdf0f981889e8c3cfc2d86bb)
I0520 13:54:02.312823       1 api.go:72] Launching server on :22624
I0520 13:54:02.312825       1 api.go:72] Launching server on :22623
[core@ip-10-0-211-18 ~]$ oc logs -n openshift-machine-config-operator machine-config-server-54kw8
I0520 13:54:02.038296       1 start.go:38] Version: v4.7.0-202105111858.p0-dirty (e3863b02b7403342cdf0f981889e8c3cfc2d86bb)
I0520 13:54:02.041284       1 api.go:72] Launching server on :22624
I0520 13:54:02.041290       1 api.go:72] Launching server on :22623


Version-Release number of selected component (if applicable):


How reproducible:

I am not certain if this completely related but this cluster was configured to use STS credentials where the service account token is used to authenticate with AWS and fetch AWS credentials for interacting with the API.


Expected results:


Additional info:

Comment 1 Yu Qi Zhang 2021-05-26 18:23:21 UTC
Based on the machine-config-server logs, none of the logs had any requests from worker nodes (I think the one line of "Pool worker requested by address:" was your manual curl). This indicates that the worker nodes were unable to reach the MCS endpoint. The console logs also seem to point to this, which could mean a networking issue of some sort.

I'm not very familiar with STS credentials, so maybe we need some other eyes on this. Moving to the CoreOS team to see if they have more insight from Ignition's perspective

Comment 2 slowrie 2021-05-26 18:42:58 UTC
From the logs posted Ignition is receiving a 500 Internal Server Error when querying for the config from the MCS and is correctly retrying. Is there an actual hard-stop error that occurs or is it a timeout with this loop occuring?

I'm quite surprised that there would be no logs in the MCS when it's consistently returning a 500 Internal Server Error, is there a case where that can occur @jerzhang?

Comment 3 Yu Qi Zhang 2021-05-26 18:53:12 UTC
The MCS does have situations where it can return a StatusInternalServerError, but it would also be accompanied by error logs in the MCS. At the very least it would have logged the request before throwing the StatusInternalServerError, which made me suspect that the node wasn't able to actual reach the server. I'm not aware of another situation that the MCS would return StatusInternalServerError outside of those scenarios

Comment 4 Greg Sheremeta 2021-05-26 20:18:27 UTC
STS is an alternative new style of OCP installation. Instead of supplying AWS access key and secret access key and storing them in the cluster, Customer passes in role names to use.

It's tech preview in 4.7:
https://docs.openshift.com/container-platform/4.7/authentication/managing_cloud_provider_credentials/cco-mode-sts.html

sending in 4 STS IAM roles works. We see the problem when sending in 5. Any chance there's some truncation happening somewhere?

Comment 6 slowrie 2021-06-01 14:19:37 UTC
Can you please include full logs & a sanitized version of the Ignition config?

Comment 13 Red Hat Bugzilla 2023-09-15 01:07:15 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.