1947684 – MCO on SNO sometimes has rendered configs and sometimes does not

Bug 1947684 - MCO on SNO sometimes has rendered configs and sometimes does not

Summary: MCO on SNO sometimes has rendered configs and sometimes does not

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Ryan Phillips
QA Contact:	MinLi
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1951009 (view as bug list)
Depends On:
Blocks:	dit
TreeView+	depends on / blocked

Reported:	2021-04-08 21:44 UTC by Ryan Phillips
Modified:	2021-07-27 22:58 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:58:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2517	0	None	open	Bug 1947684: delay kubelet config readiness until after pools and controller config are ready	2021-04-09 13:22:49 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:58:32 UTC

Description Ryan Phillips 2021-04-08 21:44:26 UTC

Description of problem:

Using a bootstrap node to create a SNO cluster [4.8.0-0.nightly-2021-04-08-124424], I am seeing the following issues:

1. I'm seeing MCD running twice on the single node
2. On install named A, there was a rendered config in the 'worker' pool, and nothing in the master pool. Custom configurations (Kubelet Configs, etc) would not be able to create a new machine config since the pool didn't have any rendered configs.
3. On install named B (a subsequent install) both master and worker pools were missing rendered configs. [1]

1. https://gist.githubusercontent.com/rphillips/ebd65549fb7ea446906ad9d29d49f6c1/raw/9fdad6d0f4572714f950a6ac0e4175bb5f5366dc/gistfile1.txt

Version-Release number of selected component (if applicable):


How reproducible:
Seems to be different behavior on 3 separate installs.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Yu Qi Zhang 2021-04-08 23:09:12 UTC

So just installing a regular SNO cluster with your reported version, I can see:

[root@yzhang tmp]# oc get nodes
NAME                                        STATUS   ROLES           AGE   VERSION
ip-10-0-128-18.us-west-2.compute.internal   Ready    master,worker   34m   v1.20.0+5f82cdb
[root@yzhang tmp]# oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-722813928e4dc89c4278d6e66ccb1488   True      False      False      1              1                   1                     0                      33m
worker   rendered-worker-349538575201cc727fd5db23354ff784   True      False      False      0              0                   0                     0                      33m
[root@yzhang tmp]# mcopods
NAME                                         READY   STATUS    RESTARTS   AGE
machine-config-controller-5954c58ff6-bprps   1/1     Running   2          31m
machine-config-daemon-264bp                  2/2     Running   0          33m
machine-config-operator-776c745668-zg7xx     1/1     Running   2          38m
machine-config-server-wwvgg                  1/1     Running   0          31m

So in a regular operation, there should be:

1. one MCD running
2. rendered master and worker
3. worker pool being empty

It sounds to me that the kubeletconfig likely failed generation, which in turn caused the MCC to fail generating the rendered config. Do you have any logs from the failed runs?

Comment 2 Yu Qi Zhang 2021-04-09 00:22:12 UTC

I tried pre-applying the kubeletconfig manifest pre-install and I was able to reproduce:

[root@yzhang 04-08]# oc get nodes
oNAME                                         STATUS   ROLES           AGE   VERSION
ip-10-0-153-126.us-west-1.compute.internal   Ready    master,worker   43m   v1.20.0+5f82cdb
[root@yzhang 04-08]# oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master                                                      False     True       True       1              0                   0                     1                      41m
worker   rendered-worker-1eecc621170fe6d6fa350361bef8b00e   True      False      False      0              0                   0                     0                      41m
[root@yzhang 04-08]# mcopods
NAME                                         READY   STATUS    RESTARTS   AGE
machine-config-controller-5954c58ff6-k95bz   1/1     Running   3          39m
machine-config-daemon-lmpqr                  2/2     Running   0          41m
machine-config-operator-66c4ddf8df-n6njp     1/1     Running   3          49m
machine-config-server-qk4pk                  1/1     Running   0          39m

The master pool has no config but it indeed gets generated.

# oc get mc
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          ca8fcb887f153604556f2706cee8a33165879797   3.2.0             42m
00-worker                                          ca8fcb887f153604556f2706cee8a33165879797   3.2.0             42m
01-master-container-runtime                        ca8fcb887f153604556f2706cee8a33165879797   3.2.0             42m
01-master-kubelet                                  ca8fcb887f153604556f2706cee8a33165879797   3.2.0             42m
01-worker-container-runtime                        ca8fcb887f153604556f2706cee8a33165879797   3.2.0             42m
01-worker-kubelet                                  ca8fcb887f153604556f2706cee8a33165879797   3.2.0             42m
99-master-generated-kubelet                        ca8fcb887f153604556f2706cee8a33165879797   3.2.0             42m
99-master-generated-registries                     ca8fcb887f153604556f2706cee8a33165879797   3.2.0             42m
99-master-ssh                                                                                 3.2.0             52m
99-worker-generated-registries                     ca8fcb887f153604556f2706cee8a33165879797   3.2.0             42m
99-worker-ssh                                                                                 3.2.0             52m
rendered-master-4d4cda0f8a9e404f0e3d170f9e48201f   ca8fcb887f153604556f2706cee8a33165879797   3.2.0             42m
rendered-worker-1eecc621170fe6d6fa350361bef8b00e   ca8fcb887f153604556f2706cee8a33165879797   3.2.0             42m

It just has no config since none of the nodes are on any config for it to report. This is due to the kubeletconfig failing to get rendered during bootstrap. In the MCD logs I see:

I0409 00:14:22.993466    9851 daemon.go:769] In bootstrap mode
E0409 00:14:22.993573    9851 writer.go:135] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-1551420a8a62d1a81f239737550b6cfd" not found

So this seems more inline with your first try.

I still believe the correct way to solve this is having the kubeletconfigcontroller handle bootstrap manifests. We have it for containerruntimecontroller but not the kubeletcontroller (see https://github.com/openshift/machine-config-operator/pull/1866)

Moving over to node team to take a look

Comment 5 MinLi 2021-04-19 07:10:38 UTC

verified on version: 4.8.0-0.nightly-2021-04-18-101412

# oc get node 
NAME      STATUS   ROLES           AGE     VERSION
sno-0-0   Ready    master,worker   3h45m   v1.21.0-rc.0+2993be8

# oc get mcp 
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-768ee516526793d07d6875602ae3e2c6   True      False      False      1              1                   1                     0                      3h43m
worker   rendered-worker-c3a3ddff445e0a8fcecd38e7225b3d85   True      False      False      0              0                   0                     0                      3h43m

# oc get pod -n openshift-machine-config-operator
NAME                                         READY   STATUS    RESTARTS   AGE
machine-config-controller-8676847cf8-8wzhd   1/1     Running   4          4h5m
machine-config-daemon-7tq92                  2/2     Running   0          4h7m
machine-config-operator-84bd546ffc-48qsk     1/1     Running   5          4h18m
machine-config-server-hdqjp                  1/1     Running   0          3h54m

Comment 6 Yu Qi Zhang 2021-04-21 18:35:48 UTC

*** Bug 1951009 has been marked as a duplicate of this bug. ***

Comment 7 Pablo Iranzo Gómez 2021-04-22 09:32:36 UTC

Can this be fixed also for 4.7 ?

Comment 10 errata-xmlrpc 2021-07-27 22:58:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.