ARO uses the OCP offline registry feature (ImageContentSources), configuring these in the InstallConfig struct before running the installer in-process. We're occasionally (5% of installs?) seeing the following race condition: on install, when worker nodes are created, they sometimes manage to fetch their ignition config literally seconds before the ignition config is populated with the image content sources. The workers try to pull images from quay.io, fail, and hence installation fails. ARO is currently carrying a patch (https://github.com/jim-minter/installer/commit/dd520d7ff180d9dbf1a0c60ed5520b23c6296240) which prepopulates a machineconfig containing the image content sources as part of bootstrap to workaround this. It's a fiddle to find a cluster where this problem occurs since it happens a lot in CI, and I don't currently have a cluster or logs to hand. I investigated a cluster last week after the install had failed and saw that there were *two* rendered-worker machineconfigs, the first laying down a /etc/containers/registries.conf without the ImageContentSources configuration, and the second one with. I verified using timestamps that the failed nodes had pulled the first one. ARO is seeing this on 4.4 and 4.5. Currently the latest version we deploy is 4.5.16. I wouldn't be surprised if this affects anyone who uses ImageContentSources.
Notes from discussions with Jerry: Bootstrap machine-config operator/server is involved in bringing up the control-plane nodes, but it does not serve the compute nodes. So this bug is about the production machine-config operator vs. compute nodes. We think the event graph is something like: a imageContentSources set in the install-config.yaml. | b Installer bootstraps the cluster-version operator [1]. |\ c | Installer bootstraps the machine-config operator [2], involved in unrelated control-plane bring-up. | | d | Installer launches cluster-bootstrap [3]. | | e | Cluster-bootstrap pushes manifests, including the ICSP. | f Bootstrap CVO starts pushing manifests. | |\ | | g Bootstrap CVO creates the production CVO. | |\ | h | CVO (bootstrap or production) creates the production machine-config operator. |\| | | i | Machine-config server pulls the ICSP from the cluster and updates the Ignition configs it serves | | | | | j CVO (bootstrap or production) creates the production machine-API operator. | | | | | k Production machine-API operator creates the compute nodes. | | | | | l Production machine-API operator creates the compute nodes. | | | | | m Compute nodes pull their Ignition configs from the machine-config server. | | n | Bootstrap complete, installer tears down the bootstrap infrastructure. When (i) happens before (m), everything is fine. When (m) happens before (i), compute hangs from missing the required ICSP. Possible solutions: * Manifest install levels [4], to delay the production machine-config operator until after bootstrap-complete. That would move (h) after (n). The machine-config server might also need to grow logic around checking all available config resources before beginning to serve Ignition configs, to keep (m) from slipping in between (h) and (i). * Some sort of note left by the bootstrap machine-config server (c) to tell the production machine-config server what resources to expect, so the production machine-config serer could avoid serving Ignition configs between (h) and (i). * Adding machine-health checks to compute nodes by default, so compute nodes that loose this race and hang up get deleted and re-provisioned. This risks delaying the install to the point that the install-complete wait times out, but MHCs on compute will also be useful for day-2 robustness. * More ideas? [1]: https://github.com/openshift/installer/blob/38c1d538439f2b087dfe1fe02abb782319fea840/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L50-L69 [2]: https://github.com/openshift/installer/blob/38c1d538439f2b087dfe1fe02abb782319fea840/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L218-L287 [3]: https://github.com/openshift/installer/blob/38c1d538439f2b087dfe1fe02abb782319fea840/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L334-L345 [4]: https://github.com/openshift/enhancements/pull/477
I just remembered we had https://bugzilla.redhat.com/show_bug.cgi?id=1874818 So, I'm not quite sure if I understood the diagram correctly. It might be missing some distinctions between the bootstrap and the cluster operation for the MCO, and "Machine-config server pulls the ICSP from the cluster and updates the Ignition configs it serves" is not quite the correct phrasing here. Let me try to start from the top: The MCO has a bootstrap mode which it runs MCC + MCS in bootstrap mode. The bootstrap MCS serves only master nodes so it shouldn't be a problem here. The bootstrap MCC however is responsible for creating the initial rendered machineconfigs that will get served to both master and workers. In the bootstrap MCC, this should be reading and creating the MC: https://github.com/openshift/machine-config-operator/blob/bc53ddcf0f380f2ed2db2d57e33d613328a4f163/pkg/controller/container-runtime-config/container_runtime_config_controller.go#L794 It sounds like in some cases that fails (maybe the bootstrap MCC logs would show something, maybe not) to read the cluster icsp so it doesn't generate any MCs (which might not be an error if there is no cluster icsp in the templates dir yet for some reason), and the cluster continues the bootstrap process. When the cluster level MCO comes up, it then goes through the syncs again, including syncing the cluster MCS and MCC (MCS comes first). Now it sounds like that at any point between the bootstrap MCC finishing and the cluster MCC coming up, the icsp change could have taken effect. Right? There's a caveat here: the cluster MCs you see when you do `oc get mc` should all be generated by the cluster MCC. Which means that if the icsp change took effect in between, the cluster MCC should have only generated the new one, and instead of 2 MCs, you would instead only get 1 and then when the MCD runs, you should get the cryptic "rendered-worker-xxx not found". Now, I've only seen this happen to master nodes in the past, so I'm not 100% sure I'm correct in that assessment. Basically, what this means is that the actual icsp change that the CVO should have applied to the cluster long ago gets applied after the cluster MCC has started running and synced at least once. That doesn't sound right since other configs that come from templates are all correct, and unless the CVO differentiates how it applies the two, the icsp change should not have been applied that late. Also, for the icsp changes you're applying, is it worker only? Does masters also have that change? If so the other question here would be why the master configs get applied correctly when they both should be going through the same flow.
We have concerns about this landing in 4.7. We'd like to discuss designs for 4.8.
We're still investigating this bug. Stay tuned.
*** Bug 1874818 has been marked as a duplicate of this bug. ***
*** Bug 1945863 has been marked as a duplicate of this bug. ***
Is there a reason the priority/severity for this bug was upgraded? We plan on fixing this bug but would like to wait until after feature freeze to do so.
Lowering priority until we understand the problem clearly.
Will review again for a future sprint.
Can the machine config operator team review this bug? From what I can see, there does not look like there is an issue on the installer side. This could also potentially belong to the CVO, but I think we need to rule out an MCO bug first. This BZ is listed as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1874818 which has a reproducer This original BZ does not have good logs associated with it. Richa, can we get a log bundle from an affected cluster?
Wait, I thought I'd pinned down the reason in comment 2, with [1] up to keep the CVO from racing with the cluster-update script to get these resources into the cluster. Is there more work I need to do to demonstrate that as the issue, or sell folks on my proposed mitigation? [1]: https://github.com/openshift/enhancements/pull/477
https://bugzilla.redhat.com/show_bug.cgi?id=1899750#c3 raised questions that I thought should be checked out by the MCO, particularly: * considering the bootstrap MCC renders an initial worker machineconfig along with the master machineconfig, why does the worker lack the ICSP but the master doesn't? * why are other machineconfigs which depend upon similar template data rendered correctly? That is, the question suggested by this statement: "That doesn't sound right since other configs that come from templates are all correct, and unless the CVO differentiates how it applies the two, the icsp change should not have been applied that late." I don't have much knowledge of this area beyond what I have read in this BZ. I don't think this belongs to the installer (correct me if I'm wrong), but could be either CVO or MCO. Based on my reading of the BZ, I think the MCO should rule out bugs before this moves to the CVO where the solution would be adding functionality (rather than correcting a bug).