Bug 1947477 - 4.7->4.6 rollback fails due to MCO requiring new ignition spec "Failed to render configuration for pool master: parsing Ignition config failed: unknown version. Supported spec versions: 2.2, 3.0, 3.1"'
Summary: 4.7->4.6 rollback fails due to MCO requiring new ignition spec "Failed to ren...
Keywords:
Status: NEW
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: MCO Bug Bot
QA Contact: Rio Liu
URL:
Whiteboard:
: 1975975 (view as bug list)
Depends On: 1940207 1950261
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-08 14:40 UTC by Federico Paolinelli
Modified: 2022-06-30 23:03 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1940207
Environment:
job=periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback=all
Last Closed: 2021-05-05 20:50:59 UTC
Target Upstream Version:
mkrejci: needinfo-


Attachments (Terms of Use)

Comment 1 Federico Paolinelli 2021-04-08 14:41:53 UTC
Cloned this to track the MCO side described here https://bugzilla.redhat.com/show_bug.cgi?id=1940207#c5 and to leave 1940207 on the network bug

Comment 2 Federico Paolinelli 2021-04-08 14:46:06 UTC
Adding more context: the original bz was related to a downgrade issue on CNO. While fixing it, I saw the rollback did not finish due to:

    lastSyncError: 'pool master has not progressed to latest configuration: controller version mismatch for rendered-master-cb2db7df54e993c796b76a2242b3e08a expected d5dc2b519aed5b3ed6a6ab9e7f70f33740f9f8af has b5723620cfe40e2e4e8cbdcb105d6ae534be1753: pool is degraded because rendering fails with "": "Failed to render configuration for pool master: parsing Ignition config failed: unknown version. Supported spec versions: 2.2, 3.0, 3.1", retrying'
    master: 'pool is degraded because rendering fails with "": "Failed to render configuration for pool master: parsing Ignition config failed: unknown version. Supported spec versions: 2.2, 3.0, 3.1"'
    worker: 'pool is degraded because rendering fails with "": "Failed to render configuration for pool worker: parsing Ignition config failed: unknown version. Supported spec versions: 2.2, 3.0, 3.1"'

There is a discussion on the original bz whether MCO should support or not rollbacks, but I'd like to keep the MCO and CNO issues separated so we don't loose track.

Comment 3 Michelle Krejci 2021-04-13 22:16:23 UTC
The MCO team, in discussion with @mrussell and @acrawfor, does not regard rollbacks as a bug. It may be a feature that we want to support. Given the level of discussion on the original bug, we can discuss further at the Platform and Lifecycle Architecture meeting. Pending the outcome of that meeting, I will close this bug.

Comment 4 Michelle Krejci 2021-04-26 18:28:04 UTC
Currently, we state (to customers) that we don’t support rollbacks: “Reverting your cluster to a previous version, or a rollback, is not supported. Only upgrading to a newer version is supported." https://docs.openshift.com/container-platform/4.7/updating/understanding-the-update-service.html#update-service-overview_understanding-the-update-service 

This was also discussed in greater detail at the Lifecycle and Pillar meeting April 22, 2021 https://docs.google.com/document/d/1QrvzW0QEftUMjNBCSkzG1m4yO9tNnTn3SHuoU5hiUMw/edit#heading=h.qcyk8b39hg2h. Closing for now since this is not currently supported.

Comment 6 Clayton Coleman 2021-04-30 02:40:28 UTC
Can I get more human readable description of what the actual bug in MCO is:

We upgrade to 99 in CVO to 4.7 from 4.6 (so new MCO, new nodes).  We then start a rollback, which means we go from beginning of payload to end on 4.6.  So we should be applying apiserver, network, etc then mco. MCO is deployed to 4.6 (nodes are still 4.7).  

At this point, MCO/MCS is serving some older spec version???  

^ fill in the explanation here in something a non-MCO-guru can understand (i.e. component X at 4.6 can't work with component Y at 4.7, etc).

Comment 7 W. Trevor King 2021-04-30 05:23:10 UTC
Machine-config server should only come in for new nodes, which isn't the issue here.  I dunno if anyone actually linked to gathered assets from the run that spawned this bug [2] (maybe wasn't even in the CI account).  But Jerry points out [3]:

> The error is expected. The MCO doesn't support downgrades, so the 4.6 MCO doesn't understand how to parse ignition 3.2 configs (4.7). This in turn means unfortunately all downgrades from 4.7->4.6 will fail

So if I'm understanding, the fix would be "teach the 4.6 MCO to understand 3.2 Ignition configs and down-convert when it sees them, mumble mumble anything which cannot be down-converted".  However, looking at [4,5]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1387886270079832064/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + (.completionTime // "-") + " " + .state + " " + .version'
  2021-04-29T23:34:37Z - Partial 4.6.27
  2021-04-29T22:32:41Z 2021-04-29T23:34:37Z Partial 4.7.0-0.ci-2021-04-29-142719
  2021-04-29T22:01:14Z 2021-04-29T22:29:21Z Completed 4.6.27
  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1387886270079832064/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2021-04-29T22:29:21Z Available=True : Done applying 4.6.27
  2021-04-30T01:12:02Z Failing=True ClusterOperatorDegraded: Cluster operator ingress is reporting a failure: Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available)
  2021-04-29T22:32:41Z Progressing=True ClusterOperatorDegraded: Unable to apply 4.6.27: the cluster operator ingress is degraded
  2021-04-29T22:01:14Z RetrievedUpdates=False NoChannel: The update channel has not been configured.

So probably not worth sinking time into the MCO vs. Ignition spec issues until that earlier hang-up gets sorted.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1387886269920448512
[2]: https://github.com/openshift/machine-config-operator/pull/2506#issuecomment-814168869
[3]: https://github.com/openshift/machine-config-operator/pull/2506#issuecomment-815024614
[4]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback
[5]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1387886270079832064

Comment 8 Yu Qi Zhang 2021-05-03 23:49:13 UTC
Like Trevor says, the fundamental issue is basically that different versions of MCO support different versions of ignition spec'ed configs. A breakdown

4.5: Ignition spec v2.2, 3.0 (sort of)
4.6: Ignition spec v2.2, 3.0, 3.1
4.7: Ignition spec v2.2, 3.0, 3.1, 3.2
4.8: same as 4.7

Correspondingly the version of MCO that gets deployed generates a rendered config in the newest version, so 

4.5: rendered config on 2.2
4.6: rendered config on 3.1
4.7: rendered config on 3.2
4.8: same as 4.7

So this means a rendered config created by the 4.7 MCO cannot be read by the 4.6 MCO, because newer ignition versions can contain spec fields that the older one does not have. This is especially the case for the 4.5->4.6 update where we went from spec 2 to spec 3, and that is basically a one-way trip, since the major version bump is a complete overhaul.

This is fundamentally where the error is coming from: when rolling back the 4.6 MCO rolls out, sees that the nodes have updated to some "3.2" config, tries to understand it, and cannot. So it cannot validate if the nodes are in a good state or not to perform the downgrade, and thus it fails.

So then, the request here is to: in any future version of Openshift where we update the ignition spec version, we must also backport it as a feature to the previous y release. In this case, that would mean we need to backport 3.2 support (which was added in 4.7) to 4.6.

I personally feel that this is a feature request for 4.8+ (framing: MCO backports ignition spec bumps to 4.y-1 for 2-way compatibility always). I think that in itself should be 99.9% good for downgrades. For example, based on the above, 4.8 today should not fail the rollback to 4.7. As for 4.7->4.6 and 4.6->4.5, I am leaning towards marking those as not bugs. Feel free to tell me if the above assessment is wrong in any way

Comment 9 W. Trevor King 2021-05-03 23:57:06 UTC
> So then, the request here is to: in any future version of Openshift where we update the ignition spec version, we must also backport it as a feature to the previous y release.

Alternative, although forward-looking-only, approach would be to teach the MCO to write a version that the previous minor understands.  If 4.7 had written 3.1, 4.7->4.6 rollbacks would have been fine.  And if 4.6 wrote 2.2, 4.6->4.5 rollbacks would have been fine.  I dunno how much effort it is to fix existing releases via backports, but if, say, 4.9 learns how to read 3.3 (or whatever) but keeps the default at 3.2 for a minor, we'll avoid getting into this situation with 4.9 -> 4.8 rollbacks.

Comment 10 Yu Qi Zhang 2021-05-04 00:28:32 UTC
> Alternative, although forward-looking-only, approach would be to teach the MCO to write a version that the previous minor understands.

That does get into another issue. For example, the 4.7 spec 3.2 bump was to support LUKS encrypted storage in 4.7, which as I understand was not in 3.1, so I don't think that's feasible

Comment 11 W. Trevor King 2021-05-04 00:57:13 UTC
Looks like 3.2 was stabilized 2020-10-13 [1].  4.6.0 was 2020-10-21 [2].  The bump to 3.2 landed in the MCO on 2020-12-04 [3].  I agree that if we need a feature that's so young, backporting an ability to read the new spec version to the older minor's MCO is our only option.  But hopefully Ignition will get out in front of us a bit further, and we can auto bump an understanding of the new spec versions before we need to consume the features they add.  Any ideas when Ignition 3.3 is due to be stabilized?

[1]: https://github.com/coreos/ignition/pull/1103#event-3874509356
[2]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.6.0
[3]: https://github.com/openshift/machine-config-operator/pull/2248#event-4072173071

Comment 12 Benjamin Gilbert 2021-05-04 02:36:05 UTC
Ignition spec 3.3 is expected for OCP 4.9, and we usually stabilize the spec fairly late in the OCP development cycle in case last-minute issues pop up.

So far, spec stabilizations have been driven by OCP needs, and it wouldn't have been practical to stabilize a spec one OCP release before it was needed.  Also, we prefer to develop all the cooperating pieces of code together (Ignition, RHCOS support glue, Butane transpilation, client code), since that helps flush out any design flaws before we commit to a stable config spec with particular semantics.  As a result, for Ignition to lead OCP by one release, we'd essentially need to develop an entire feature set, land it, and then tell users not to use it for one cycle.

As an alternative, ign-converter <https://github.com/coreos/ign-converter/> may be able to help here.  It already has the ability to downconvert an Ignition config to an earlier spec version, or fail if the input config can't be expressed in that spec.  If we backported a newer converter to the previous MCO release we should be able to get the correct semantics: automatically downgrade a config if the cluster hasn't started using newer features yet, and fail with a clear explanation if it has.

Comment 13 W. Trevor King 2021-05-04 03:52:52 UTC
> If we backported a newer converter to the previous MCO release we should be able to get the correct semantics: automatically downgrade a config if the cluster hasn't started using newer features yet, and fail with a clear explanation if it has.

But if you defer the bump until you need the new feature, you'll break rollbacks once the new MCO comes around and starts upconverting Ignition config spec versions, right?  Unless the downconverter can say "I can't express $NEW_STUFF in $OLD_SPEC_VERSION, but I can certainly express which files will need to be ripped up off the disk to unroll this MachineConfig".  Because all we need is enough information for the incoming, older-version MCO to be able to roll back to an older-version MachineConfig.

> Also, we prefer to develop all the cooperating pieces of code together (Ignition, RHCOS support glue, Butane transpilation, client code), since that helps flush out any design flaws before we commit to a stable config spec with particular semantics.

I didn't notice any 3.2-alpha PRs landing in the MCO repo, but I could certainly have missed them.  Or does the combined development mostly happen in an in-flight MCO PR?  I went back through [1], and I didn't notice any references off to parallel Ignition PRs, RHCOS merge requests, etc.  I understand that you want to know that a plan will work before you commit to it in a branch that will eventually end up in production, but it's not clear to me yet why Ignition config spec bumps have to be forward incompatible.

[1]: https://github.com/openshift/machine-config-operator/pull/2248

Comment 14 Benjamin Gilbert 2021-05-04 04:41:54 UTC
>> If we backported a newer converter to the previous MCO release we should be able to get the correct semantics: automatically downgrade a config if the cluster hasn't started using newer features yet, and fail with a clear explanation if it has.
> But if you defer the bump until you need the new feature, you'll break rollbacks once the new MCO comes around and starts upconverting Ignition config spec versions, right?

Not generally.  New Ignition features, so far, have been conditional to particular use cases.  For example, if you don't need partition resizing, LUKS, gs:// URLs, or boot disk RAID, it's perfectly fine to continue using 3.1.0 configs.

If we don't want to deal with downconverting configs in the previous release (and thus backporting code), we may be able to be more selective about using newer config versions in the current release.  I believe the MCO currently always uses the current stable spec, since that lets it avoid introspecting the features needed by a particular config.  But it could probably just e.g. generate 3.2.0 configs, try downconverting them to 3.1.0, and if that succeeds, serve them as 3.1.0 instead.  I may be missing something though.

> Unless the downconverter can say "I can't express $NEW_STUFF in $OLD_SPEC_VERSION, but I can certainly express which files will need to be ripped up off the disk to unroll this MachineConfig".

The functionality of the files stage (files, systemd units, passwd) has been fairly stable, actually.  The new features have been more on the disk partitioning/formatting side, which the MCO doesn't reconcile at runtime anyway.  Also, upgrades/downgrades don't currently matter for day-1-only functionality, since upgraded clusters stay with their original bootimage and thus their original Ignition version.  Net result, "which files will need to be ripped up" isn't really a thing.

> I didn't notice any 3.2-alpha PRs landing in the MCO repo, but I could certainly have missed them.  Or does the combined development mostly happen in an in-flight MCO PR?

The MCO changes land pretty late, currently.  That part isn't great and we'd like to improve it.  We do a better job at the OS integration level.

Every Ignition release includes a WIP copy of the next stable spec, with no stability guarantees.  The recent 2.10.1 release includes probably more than half of the new functionality that will end up stabilizing as spec 3.3.0.  It can be invoked by writing a config with version "3.3.0-experimental".  Crucially, 3.3.0-experimental configs will no longer be accepted by newer Ignition releases after 3.3.0 is stabilized.

Butane uses a similar versioning trick, and the Dracut glue in fedora-coreos-config and openshift/os generally keeps pace with the latest Ignition release.  Those are the packages where co-development is most visibly helping us right now.  It's actually been pretty common to have to rethink Ignition functionality as we get experience with how it integrates into the OS.

Comment 15 W. Trevor King 2021-05-04 18:52:41 UTC
> Not generally.  New Ignition features, so far, have been conditional to particular use cases... The new features have been more on the disk partitioning/formatting side, which the MCO doesn't reconcile at runtime anyway.

Oh, nice :).  Seems like either downconverting or introspecting features and picking the minimal required spec version would work for most rollbacks, and I have no opinions on which of those the MCO folks feel would be less work.  But can we pick one, or some other alternative, and start doing that the next time we bump the Ignition spec?  I'll leave it to other folks to decide if it's worth it to try and green up 4.7 -> 4.6, but I'd like 4.9 -> 4.8, etc., to not get stuck on something similar.

Comment 16 Yu Qi Zhang 2021-05-05 00:31:10 UTC
I think we can forward look to enable 4.9->4.8. One possibility to prevent the need of backporting in the future, is if the MCD sees a 3.x+1 config of its current support (e.g. 3.3 in 4.9, but it allows a 3.4 config), it just interprets it as a 3.3 config.

This should be ok since any new functionality added to 3.4 shouldn't be used during upgrades anyways, and new installs shouldn't be able to go to a previous version.

Comment 17 Yu Qi Zhang 2021-05-05 20:50:59 UTC
After some further considerations I am going to close this as NOTABUG again, and revert my position on the MCO support of rollbacks on y-stream downgrades (which is what this bug is). There is no point in doing so unless we have underlying RHCOS and RHEL support, and e.g. if RHCOS or RHEL went from RHEL 8.3->8.4 during a y-stream, there is no support as far as I'm aware for downgrading to 8.3 again.

Thus once the nodes have begun the upgrade process, the MCO cannot confidently say you will be able to rollback. Thus there is no reason for us to consider this support. We can reconsider this once the higher level discussions for RHCOS happens.

Comment 18 Clayton Coleman 2021-05-06 14:29:41 UTC
I have reopened this, please do not close again without my approval.

> 4.5: Ignition spec v2.2, 3.0 (sort of)
4.6: Ignition spec v2.2, 3.0, 3.1
4.7: Ignition spec v2.2, 3.0, 3.1, 3.2
4.8: same as 4.7

> Correspondingly the version of MCO that gets deployed generates a rendered config in the newest version, so 

4.5: rendered config on 2.2
4.6: rendered config on 3.1
4.7: rendered config on 3.2
4.8: same as 4.7


> Alternative, although forward-looking-only, approach would be to teach the MCO to write a version that the previous minor understands.  If 4.7 had written 3.1, 4.7->4.6 rollbacks would have been fine.  And if 4.6 wrote 2.2, 4.6->4.5 rollbacks would have been fine.  I dunno how much effort it is to fix existing releases via backports, but if, say, 4.9 learns how to read 3.3 (or whatever) but keeps the default at 3.2 for a minor, we'll avoid getting into this situation with 4.9 -> 4.8 rollbacks.

This is not correct.  We should NEVER require the use of a new API version until ALL supported components have been upgraded to a new version.  Effectively what MCO *MUST* do (this is not an option) is use a rendered config the previous Y version of openshift understands until the *next* Y.  We do this for kube and internal etcd, and we should be doing it for ignition.

To describe what SHOULD have been done in rendering config:

4.5: rendered config on 2.2
4.6: rendered config on 2.2 (because 4.5 didn't support 3.0 really)
4.7: rendered config on 3.1 (because 4.6 didn't support 3.2)
4.8: rendered config on 3.2 (because 4.7 supports 3.2)

When doing N-1 compatibility you need to remain fully compatible throughout the whole upgrade cycle by upgrading control plane first (this has been how kube and OCP have worked).

So the bug here is that (i think?) MCO is too aggressively switching to rendered config, and instead has to do those transitions ONLY on minors when new versions support everything.  It's still fine during CI and everything to test the newer verisons (probably as an optional job), but it is NOT safe to start requiring a new config version until it has been live for at least one Y release.

Comment 19 Clayton Coleman 2021-05-06 14:32:30 UTC
Backporting the required ignition changes is probably not a good plan since that introduces a lot of risk in the previous z (is an API change?) and the safest path is for MCO to follow the same versioning and API rules that control plane, kubelet, etc all do (we do the same thing with CNI/CRI/CSI).  I know we have a better doc upstream than https://kubernetes.io/docs/setup/release/version-skew-policy/ that describes WHY we have this policy, but we have considered this policy authoritative for OCP since 3.6 or so and it applies to all components and all APIs that must support the openshift version skew rules.

Comment 20 Steve Milner 2021-05-06 21:19:45 UTC
There are a few things here in this BZ I'd like to unpack:

1. Rollbacks
As noted in previous comments this isn't something that is currently supported, though we do have a test for it. Adding support for rollbacks is a fair enhancement request. If we want to support this direction let's get an RFC and/or enhancement. Adding support for this will probably involve multiple teams and require design.

2. Config update control
Fair point. As Benjamin noted above, and Clayton did later, new features that Ignition adds in a spec could wait a release so we could ensure N-1. While most of the spec bumps have been additive for new features it may make sense for us to slow down and batch a bit more when it comes to Ignition/MachineConfig in releases.

Comment 21 Aravindh Puthiyaparambil 2021-05-07 16:04:25 UTC
@ccoleman is there a reason for the bug to be in POST? POST implies a fix is in the works which is not the case. I am moving it to NEW to reflect the accurate state.

Comment 23 W. Trevor King 2021-05-07 21:33:29 UTC
I'm pretty sure POST was supposed to be NEW in comment 18.

Comment 24 Clayton Coleman 2021-05-10 15:28:30 UTC
Attempting to talk through the various issues here, I want to start at the key point that sometimes is confused with rollbacks (we have not asked teams to support rollbacks, but we DO require the following):

"Does MCO, during a Y-upgrade, ever put itself in a spot where it cannot function properly?"

It is required (and would be a release blocking bug) that all components of OpenShift remain functional during the upgrade process (y or z). That means to a user there is either no disruption (a goal for our API endpoints) or at most a few seconds of disruption (for controller driven processes that are not directly hit) from a user's perspective. To achieve that, we set up certain patterns and models (what I referred to from kube above, but not well documented in general):

1. We run our API services HA, and make sure that API changes are always forwards compatible during ANY upgrade (so an old client, talking to the new version behind a load balancer, sees no behavior change if they are suddenly connected back to the old version)
2. We generally do not deprecate or remove APIs, and when we do we gate upgrades that might block them on ensuring no old client can still be talking to the version about to be removed
3. Controllers could be connected to an arbitrary API server during upgrade (old or new) so they must be written and tested as if they could be regressed during an upgrade (upgrades are not one way)
4. If we have to make breaking format changes, we do that *after* the upgrade via a separate process that is initiated by a user (for instance, our migration from etcd2 -> 3 involved a breaking change and was required to be done after upgrading to 3.6 but before upgrading to 3.7, so 3.7 simply required that the format change was already in place)
5. We make minimal / no API / incompatible changes during z streams in order to simplify reasoning about the safety of z streams.

We test these in a variety of ways (and are trying to always improve them), and one of those tests is the rollback test because done properly (for the kube control plane at least) any implications of 1-5 are automatically hit by rollback tests in a way that causes detectable failures.  That assumption *does not* always hold for other components, which is why I was focused on how your use of ignition (a client facing API provided by your component) behaves during upgrades and whether it indicates we might not satisfy the "keeps working" invariant.  So that's the first thing to check from this bug.  For example, only new machines are ignited then your spec bump is ok - but if they aren't you would be "functionally unavailable" and thus this would be an urgent bug.

Secondly, Ben's request here was at my request because the failure of MCO to rollback blocks other components ability (the control plane) to test these critical assumptions as we add z streams to both releases.  If you have broken 4.7 to 4.6 rollback tests, we cannot keep verifying that our EUS *forward* upgrade is still safe, which is of massive importance for EUS stream.

Next, the other constraint for upgrades is:

"Is MCO sufficiently tested so that the MCO team is confident that if the upgrade process is stopped or disrupted AT ANY POINT that the MCO remains available and functional for users"

The other goal of rollback tests (as described above) is that they simulate one class of problems that can occur due to stopped upgrades.  The CVO could fail or deadlock at any point, or during a deployment rollout the new pod could be killed at any point and the old pod resurrected. A team that isn't aggressively testing for that themselves is implicitly leveraging the rollback tests and our high level detection to find those issues.  Since having each team have the expertise to debug these sorts of problems does not necessarily scale, rollback and other tests are our vehicle for simulating problems - when we hit the issues I'm describing in a general way we can identify the assumptions / bugs that led to that and guide teams to fix without each team having to build their own tests (or lots of their own tests). If the MCO team believes that they are 100% available during their entire upgrade process even if ANY part of the upgrade process hangs arbitarily long (i.e. your new CRD is rolled out but not your operator, your operator gets disrupted so the old code starts running AFTER the new code is rolled out) then this is less of an issue for them.

An example about this causing you to fail with respect to ignition spec is: "if you require new ignition for booting machines, and that assumes that you're using new RHCOS, are you aware that the osimageurl is updated AFTER your operator rolls out, so if CVO hangs before it updates osimageurl you can't boot new machines?"  That would be another reason this would be an urgent bug, which rollback is intended to help verify (although it's not optimally efficient)

However, like the previous test, the failure of 4.6 to 4.7 rollback now leaves the entire platform (other componetns) unable to exercise those simulations, which means that MCO has regressed our ability to catch issues of this sort in 4.8.

So, the three asks are:

1. confirm that you are functionally available during upgrade
2. confirm that you remaining functionally available if any part of the process fails
3. help us fix 4.7 to 4.6 rollback with a workaround for MCO because otherwise we have regressed ALL test coverage of upgrade safety on our first EUS step which is critical to our ability to deliver EUS.

Comment 25 Colin Walters 2021-05-10 18:42:41 UTC
This is a semi-aside but: We had a chat around this and I think on the OS side what we need to support ideally is "Keep running with latest OS but rollback to specific N-1 kernel".  Because 95% of the issues on an upgrade that *could* be fixed by rollback are going to be the kernel.
That could be some sort of MCO feature, even something as streamlined as:

```
operatingSystemHotfix:
  kernel: 4.6
```

To implement that the MCO would walk the CVO upgrade history, find the last release image that matched 4.6, pull its machine-os-content, pull the kernel out of that and apply it to the nodes.

Comment 26 Yu Qi Zhang 2021-05-11 03:50:35 UTC
A few things to note:

> 2. We generally do not deprecate or remove APIs, and when we do we gate upgrades that might block them on ensuring no old client can still be talking to the version about to be removed

The MCO never fully deprecated any API in the ignition spec. The newest MCO has support all the way back to 4.1 generated configs.

> 3. Controllers could be connected to an arbitrary API server during upgrade (old or new) so they must be written and tested as if they could be regressed during an upgrade (upgrades are not one way)

The "upgrades are not one way" is what concerns me, since I was under the impression that the whole point of upgrades and graphs is that they are "one way" today. The new MCO controller, again, understands old configs always

> 4. If we have to make breaking format changes, we do that *after* the upgrade via a separate process that is initiated by a user

Using this as an easier to highlight example, this wasn't really done in the MCO up to now (and correspondingly RHCOS) for at least the major ignition spec bump we performed during the 4.5->4.6 timeframe. The ignition spec bump from 2.x to 3.x (which is much more complex than the 3.1->3.2 in this bug) was done automatically. Meaning that if you had somehow made older definitions we did not support, the upgrade would break (but only for rare scenarios that I don't think ever manifested, which is good).

This is partially why I wanted to frame this as a forward looking feature: to design upgrades in the MCO such that we are able to perform format breaking changes with guarantees.

> For example, only new machines are ignited then your spec bump is ok - but if they aren't you would be "functionally unavailable" and thus this would be an urgent bug.

and

> "if you require new ignition for booting machines, and that assumes that you're using new RHCOS, are you aware that the osimageurl is updated AFTER your operator rolls out, so if CVO hangs before it updates osimageurl you can't boot new machines?" 

Fortunately newly ignited machines do not see this issue. The machine-config-server always serves spec versions based on the incoming ignition binary request version, so the MCS is able to on the fly serve all supported ignition versions up to the newest one in that MCO version. This is required since we do not bump bootimages by default today, so a 4.7 cluster may still be using 4.4 bootimages, thus the compatibility will be there for the foreseeable future.


> "Is MCO sufficiently tested so that the MCO team is confident that if the upgrade process is stopped or disrupted AT ANY POINT that the MCO remains available and functional for users"

Just my personal perspective: the MCO itself is relatively resilient in the upgrade process, in the sense that the new/old MCO components are able to intercommunicate provided that they are in the one-way upgrade path. The issue (e.g. this bug) only occurs if the rollout order is not observed, e.g. an old daemonset attempting to manage an updated node (daemonset should roll out before nodes are updated).

> 1. confirm that you are functionally available during upgrade

This should not have changed in the general sense for the MCO (barring some bugs in the new version perhaps)

> 2. confirm that you remaining functionally available if any part of the process fails

We should return the error as noted, but otherwise the MCO is able to run still during failure scenarios presented.

> 3. help us fix 4.7 to 4.6 rollback with a workaround for MCO because otherwise we have regressed ALL test coverage of upgrade safety on our first EUS step which is critical to our ability to deliver EUS.

This really is the crux of the discussion. As I see it one option moving forward with the MCO (e.g. spec 3.3 in 4.9) would be the MCO (or ignition converter), when generating the rendered config (which is what's causing the issue in this bug), chain parsed it such that it renders it in the oldest supported version of the spec.

This means that, for example, all specs would render to 3.1 unless it requires a config snippet only supported in 3.2, etc., such that new installs can still take advantage of the feature, but upgrades don't immediately jump to the new version to facilitate a rollback test like this.

Now for this bug specifically, this probably would not be recommended since 4.7 has been released, so we'd have upgraded customers from 3.1 to 3.2 back to 3.1. So our best alternative right now is either to 
1. backport full 3.2 support to 4.6
2. make an exception of dummy support of 3.2 in 4.6 via parser hacks

I think generally speaking we'd still like to position ourselves to not backport features. The MCO has not operated under the assumptions above (perhaps mistakenly) up until now, so for the MCO, it would be a feature since we've never designed it that way, although it may be considered more of a bug from the general platform perspective.

Sorry for the text dump, let me know if that doesn't make sense, and thanks for the overall context.

Comment 27 W. Trevor King 2021-05-11 04:11:53 UTC
> The "upgrades are not one way" is what concerns me, since I was under the impression that the whole point of upgrades and graphs is that they are "one way" today. The new MCO controller, again, understands old configs always

Here, the incompat is between the outgoing MachineConfig controller and the incoming MachineConfig controller, right?  Checking 4.8.0-rc.3 update CI:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1390739944372178944/artifacts/e2e-aws-upgrade/deployments.json | jq -r '.items[] | select(.metadata.name == "machine-config-controller").spec | {replicas, strategy}'
  {
    "replicas": 1,
    "strategy": {
      "rollingUpdate": {
        "maxSurge": "25%",
        "maxUnavailable": "25%"
      },
      "type": "RollingUpdate"
    }
  }

So the update risk would be something like:

1. You surge in a new controller pod.
2. Outgoing pod is terminated and release its leader lease.
3. Incoming pod acquires the leader lease and updates to the new Ignition spec.
4. Incoming pod has some kind of disaster and dies.
5. Outgoing pod re-acquires the leader lease, sees the new, unrecognized Ignition specs, and sticks.
6. Another disaster keeps the Deployment controller from scheduling a replacement controller pod with the new code.

That's two disasters, and (6) in particular seems pretty low risk.  If replicas was larger than 1, I'd be more concerned.

> So our best alternative right now is...

Third option would be to ensure the rollback for 4.6->4.7->4.6 tests always kicks in before the CVO asks the MCO to update.  That conveniently preserves the rollback test for other components, because the bulk of the content that happens after the MCO is PrometheusRule and similar stuff that shouldn't be all that exposed to rollback issues.  That approach wouldn't work for non-MCO components, because if we reversed course before updating them, we'd leave MCO-rollback uncovered.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1390739944372178944

Comment 28 Colin Walters 2021-05-13 16:32:11 UTC
I think the simple way to say this is that the rendered config object in the cluster is entirely contained inside the MCO.  The controller renders it and each pod in the daemonset reads it.  So this fix:

> As I see it one option moving forward with the MCO (e.g. spec 3.3 in 4.9) would be the MCO (or ignition converter), when generating the rendered config (which is what's causing the issue in this bug), chain parsed it such that it renders it in the oldest supported version of the spec.

seems to me by far the simplest and most reliable.

The node (RHCOS) version is mostly irrelevant here because as Jerry noted, the MCS already translates when serving to the node.

Comment 29 Benjamin Gilbert 2021-05-29 00:07:19 UTC
Upstream ign-converter RFE for an function to downconvert a config as far as possible: https://github.com/coreos/ign-converter/issues/22

Comment 30 Yu Qi Zhang 2021-06-24 21:35:06 UTC
*** Bug 1975975 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.