1899750 – Race condition with ImageContentSources sometimes prevents cluster installing

Bug 1899750 - Race condition with ImageContentSources sometimes prevents cluster installing

Summary: Race condition with ImageContentSources sometimes prevents cluster installing

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Yu Qi Zhang
QA Contact:	Jian Zhang
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1874818 1945863 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-19 22:04 UTC by Jim Minter
Modified:	2021-10-25 15:54 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-25 15:54:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jim Minter 2020-11-19 22:04:20 UTC

ARO uses the OCP offline registry feature (ImageContentSources), configuring these in the InstallConfig struct before running the installer in-process.

We're occasionally (5% of installs?) seeing the following race condition: on install, when worker nodes are created, they sometimes manage to fetch their ignition config literally seconds before the ignition config is populated with the image content sources.  The workers try to pull images from quay.io, fail, and hence installation fails.

ARO is currently carrying a patch (https://github.com/jim-minter/installer/commit/dd520d7ff180d9dbf1a0c60ed5520b23c6296240) which prepopulates a machineconfig containing the image content sources as part of bootstrap to workaround this.

It's a fiddle to find a cluster where this problem occurs since it happens a lot in CI, and I don't currently have a cluster or logs to hand.  I investigated a cluster last week after the install had failed and saw that there were *two* rendered-worker machineconfigs, the first laying down a /etc/containers/registries.conf without the ImageContentSources configuration, and the second one with.  I verified using timestamps that the failed nodes had pulled the first one.

ARO is seeing this on 4.4 and 4.5.  Currently the latest version we deploy is 4.5.16.

I wouldn't be surprised if this affects anyone who uses ImageContentSources.

Comment 2 W. Trevor King 2020-11-19 22:28:12 UTC

Notes from discussions with Jerry:

Bootstrap machine-config operator/server is involved in bringing up the control-plane nodes, but it does not serve the compute nodes.  So this bug is about the production machine-config operator vs. compute nodes.  We think the event graph is something like:

a  imageContentSources set in the install-config.yaml.
|
b  Installer bootstraps the cluster-version operator [1].
|\
c |  Installer bootstraps the machine-config operator [2], involved in unrelated control-plane bring-up.
| |
d |  Installer launches cluster-bootstrap [3].
| |
e |  Cluster-bootstrap pushes manifests, including the ICSP.
| f  Bootstrap CVO starts pushing manifests.
| |\
| | g  Bootstrap CVO creates the production CVO.
| |\
| h |  CVO (bootstrap or production) creates the production machine-config operator.
|\| |
| i |  Machine-config server pulls the ICSP from the cluster and updates the Ignition configs it serves
| | |
| | j  CVO (bootstrap or production) creates the production machine-API operator.
| | |
| | k  Production machine-API operator creates the compute nodes.
| | |
| | l  Production machine-API operator creates the compute nodes.
| | |
| | m  Compute nodes pull their Ignition configs from the machine-config server.
| |
n |  Bootstrap complete, installer tears down the bootstrap infrastructure.

When (i) happens before (m), everything is fine.  When (m) happens before (i), compute hangs from missing the required ICSP.

Possible solutions:

* Manifest install levels [4], to delay the production machine-config operator until after bootstrap-complete.  That would move (h) after (n).  The machine-config server might also need to grow logic around checking all available config resources before beginning to serve Ignition configs, to keep (m) from slipping in between (h) and (i).

* Some sort of note left by the bootstrap machine-config server (c) to tell the production machine-config server what resources to expect, so the production machine-config serer could avoid serving Ignition configs between (h) and (i).

* Adding machine-health checks to compute nodes by default, so compute nodes that loose this race and hang up get deleted and re-provisioned.  This risks delaying the install to the point that the install-complete wait times out, but MHCs on compute will also be useful for day-2 robustness.

* More ideas?

[1]: https://github.com/openshift/installer/blob/38c1d538439f2b087dfe1fe02abb782319fea840/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L50-L69
[2]: https://github.com/openshift/installer/blob/38c1d538439f2b087dfe1fe02abb782319fea840/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L218-L287
[3]: https://github.com/openshift/installer/blob/38c1d538439f2b087dfe1fe02abb782319fea840/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L334-L345
[4]: https://github.com/openshift/enhancements/pull/477

Comment 3 Yu Qi Zhang 2020-11-30 16:30:52 UTC

I just remembered we had https://bugzilla.redhat.com/show_bug.cgi?id=1874818

So, I'm not quite sure if I understood the diagram correctly. It might be missing some distinctions between the bootstrap and the cluster operation for the MCO, and "Machine-config server pulls the ICSP from the cluster and updates the Ignition configs it serves" is not quite the correct phrasing here. Let me try to start from the top:

The MCO has a bootstrap mode which it runs MCC + MCS in bootstrap mode. The bootstrap MCS serves only master nodes so it shouldn't be a problem here. The bootstrap MCC however is responsible for creating the initial rendered machineconfigs that will get served to both master and workers.

In the bootstrap MCC, this should be reading and creating the MC: https://github.com/openshift/machine-config-operator/blob/bc53ddcf0f380f2ed2db2d57e33d613328a4f163/pkg/controller/container-runtime-config/container_runtime_config_controller.go#L794

It sounds like in some cases that fails (maybe the bootstrap MCC logs would show something, maybe not) to read the cluster icsp so it doesn't generate any MCs (which might not be an error if there is no cluster icsp in the templates dir yet for some reason), and the cluster continues the bootstrap process.

When the cluster level MCO comes up, it then goes through the syncs again, including syncing the cluster MCS and MCC (MCS comes first). Now it sounds like that at any point between the bootstrap MCC finishing and the cluster MCC coming up, the icsp change could have taken effect. Right? There's a caveat here: the cluster MCs you see when you do `oc get mc` should all be generated by the cluster MCC. Which means that if the icsp change took effect in between, the cluster MCC should have only generated the new one, and instead of 2 MCs, you would instead only get 1 and then when the MCD runs, you should get the cryptic "rendered-worker-xxx not found". Now, I've only seen this happen to master nodes in the past, so I'm not 100% sure I'm correct in that assessment.

Basically, what this means is that the actual icsp change that the CVO should have applied to the cluster long ago gets applied after the cluster MCC has started running and synced at least once. That doesn't sound right since other configs that come from templates are all correct, and unless the CVO differentiates how it applies the two, the icsp change should not have been applied that late.

Also, for the icsp changes you're applying, is it worker only? Does masters also have that change? If so the other question here would be why the master configs get applied correctly when they both should be going through the same flow.

Comment 4 Brenton Leanhardt 2020-11-30 18:45:30 UTC

We have concerns about this landing in 4.7.  We'd like to discuss designs for 4.8.

Comment 6 Brenton Leanhardt 2021-02-04 18:40:52 UTC

We're still investigating this bug. Stay tuned.

Comment 7 Urvashi Mohnani 2021-03-23 14:51:07 UTC

*** Bug 1874818 has been marked as a duplicate of this bug. ***

Comment 9 Jian Zhang 2021-04-08 02:32:49 UTC

*** Bug 1945863 has been marked as a duplicate of this bug. ***

Comment 10 Patrick Dillon 2021-07-12 15:12:33 UTC

Is there a reason the priority/severity for this bug was upgraded? We plan on fixing this bug but would like to wait until after feature freeze to do so.

Comment 11 Russell Teague 2021-08-02 17:21:11 UTC

Lowering priority until we understand the problem clearly.

Comment 12 Russell Teague 2021-08-24 17:28:52 UTC

Will review again for a future sprint.

Comment 14 Patrick Dillon 2021-08-26 19:15:53 UTC

Can the machine config operator team review this bug? 

From what I can see, there does not look like there is an issue on the installer side. This could also potentially belong to the CVO, but I think we need to rule out an MCO bug first. 

This BZ is listed as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1874818 which has a reproducer

This original BZ does not have good logs associated with it. Richa, can we get a log bundle from an affected cluster?

Comment 15 W. Trevor King 2021-08-28 23:05:22 UTC

Wait, I thought I'd pinned down the reason in comment 2, with [1] up to keep the CVO from racing with the cluster-update script to get these resources into the cluster.  Is there more work I need to do to demonstrate that as the issue, or sell folks on my proposed mitigation?

[1]: https://github.com/openshift/enhancements/pull/477

Comment 16 Patrick Dillon 2021-08-29 01:10:34 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1899750#c3 raised questions that I thought should be checked out by the MCO, particularly:
* considering the bootstrap MCC renders an initial worker machineconfig along with the master machineconfig, why does the worker lack the ICSP but the master doesn't?
* why are other machineconfigs which depend upon similar template data rendered correctly? That is, the question suggested by this statement: "That doesn't sound right since other configs that come from templates are all correct, and unless the CVO differentiates how it applies the two, the icsp change should not have been applied that late."

I don't have much knowledge of this area beyond what I have read in this BZ. I don't think this belongs to the installer (correct me if I'm wrong), but could be either CVO or MCO. Based on my reading of the BZ, I think the MCO should rule out bugs before this moves to the CVO where the solution would be adding functionality (rather than correcting a bug).

Note You need to log in before you can comment on or make changes to this bug.