Bug 2098299 - install-config: Strict unmarshalling conflicts with new fields
Summary: install-config: Strict unmarshalling conflicts with new fields
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.11
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.12.0
Assignee: Aditya Narayanaswamy
QA Contact: Yunfei Jiang
URL:
Whiteboard:
Depends On:
Blocks: 2105331
TreeView+ depends on / blocked
 
Reported: 2022-06-17 22:41 UTC by Eric Fried
Modified: 2023-01-17 19:50 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
* Previously, cluster installations using Hive could fail if Hive used an older version of the install-config.yaml file. This update allows the installer to accept older versions of the install-config.yaml file provided by Hive. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2098299[*BZ#2098299*])
Clone Of:
Environment:
Last Closed: 2023-01-17 19:50:08 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 6039 0 None open Bug 2098299: Switch to perform normal marshalling with unknown fields 2022-06-24 14:03:41 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:50:25 UTC

Description Eric Fried 2022-06-17 22:41:51 UTC
Thanks for opening a bug report!
Before hitting the button, please fill in as much of the template below as you can.
If you leave out information, it's harder to help you.
Be ready for follow-up questions, and please respond in a timely manner.
If we can't reproduce a bug we might close your issue.
If we're wrong, PLEASE feel free to reopen it and explain why.

Version:

$ openshift-install version
vendored dep: 4.11 pre-GA
installer binary: 4.10

Platform:
AWS

Please specify:
IPI (via hive)

What happened?
Hive needs to vendor the installer to do things like generate machinepool objects in the install-config [1]. It then invokes the installer binary for whatever OCP version is to be installed. We can only vendor one version of the installer. Recently we bumped our dependency [2] to bdbb3b1 to pick up a new field for azure. That PR merged because our CI was still testing against 4.9, which still uses lenient unmarshaling of the install-config [3]. Then we bumped our CI to use 4.10 [4] and started failing with

```
level=fatal msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to unmarshal install-config.yaml: failed to parse first occurence of unknown field: error unmarshaling JSON: while decoding JSON: json: unknown field "metadataService"
```

This is happening due to the combination of 4.10 having made unmarshaling strict [5][6] and the recent (pre-GA, but before bdbb3b1) addition of that field to the AWS MachinePool struct [7]:
Hive constructs the object (without that field) [1] and marshals it into the install-config [8]. But since the new field in our vendored version of the installer doesn't have `omitempty`, the marshaler is adding it with its nil value. This install-config gets to the 4.10 installer, which *has* strict unmarshaling but *lacks* that field, and kaboom.

We'd like some more time to think about it, but we think the proper solution may simply be to add `omitempty` to this field (and any other that's added to the same GVK of an object).

[1] https://github.com/openshift/hive/blob/f9e6e206d5c6d5f50282ea09d4711edc8ad045c5/pkg/clusterresource/aws.go#L118-L129
[2] https://github.com/openshift/hive/pull/1764
[3] https://github.com/openshift/installer/blob/release-4.9/pkg/asset/installconfig/installconfig.go#L126
[4] https://github.com/openshift/release/pull/29603
[5] https://github.com/openshift/installer/pull/5307
[6] https://github.com/openshift/installer/blob/release-4.10/pkg/asset/installconfig/installconfig.go#L128
[7] https://github.com/openshift/installer/pull/5793/files#diff-ebf1b2fb6a5229b36ee6bffb2348a61c75b4c77a30d89f5b3e510015ffe908c7R31
[8] https://github.com/openshift/hive/blob/f9e6e206d5c6d5f50282ea09d4711edc8ad045c5/pkg/clusterresource/builder.go#L404

How to reproduce it (as minimally and precisely as possible)?
See CI results: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_hive/1799/pull-ci-openshift-hive-master-e2e/1537921903648313344#1:build-log.txt%3A336

Comment 1 Eric Fried 2022-06-17 23:00:57 UTC
> add `omitempty` to this field

I've submitted https://github.com/openshift/installer/pull/6028 to do that. (It's not linking to this bug, which probably means I spelled things wrong -- I don't work with BZ much.)

Comment 2 Eric Fried 2022-06-22 16:03:03 UTC
We're discussing this in slack [1]. In summary:
- Adding `omitempty` is not sufficient to address the problem -- see [2]. The new fields must *also* be converted to pointers. I have updated https://github.com/openshift/installer/pull/6028 to do that for the EC2Metadata field. However...
- There are other fields for which this problem manifests. We don't know all of them. But at least Azure OSImage [3] is affected. (Note that this one is already `omitempty`, but is still a problem because it's not a pointer.) I've added a commit for this field to the PR.
- It is under consideration to simply revert the switch to UnmarshalStrict [4]. This would be the simplest solution, and guaranteed (?) to cover all cases, since we wouldn't need to root out individual fields.
- Alternatively, it would be lovely to have unit tests that detect this class of problem so we can a) identify existing instances, and b) fail CI, preventing future violations. The challenge here is that such a test needs to operate across releases/commits, at least to address the former. For the latter, it may be possible to simply look for new fields and enforce that they're pointers with `omitempty`. Though that gets complicated for sub-structures, whose contents don't necessarily want/need to conform to that rule.

[1] https://coreos.slack.com/archives/C68TNFWA2/p1655842108884639
[2] https://github.com/golang/go/issues/11939
[3] https://github.com/openshift/installer/blob/599c160a4e1c48ded0980955bc475ece0268083f/pkg/types/azure/machinepool.go#L45
[4] https://github.com/openshift/installer/pull/5307

Comment 6 Yunfei Jiang 2022-07-12 02:09:10 UTC
The target release of this BZ is 4.11, but linked PR https://github.com/openshift/installer/pull/6039 is on master branch (4.12).

Comment 7 Patrick Dillon 2022-07-22 14:25:02 UTC
Updated target release to 4.12 and moved this back to ON_QA.

Comment 9 Yunfei Jiang 2022-07-29 07:54:46 UTC
Tested against serval unknown fields in install-config.yaml, only one of them was printed in warning message.
I expect all unknown fields printed in warning message.

OCP Version 4.12.0-0.nightly-2022-07-27-133042

> tested with following config:

apiVersion: v1
baseDomain: qe.devcluster.openshift.com
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    aws:
      type1: aaa
      rootVolume:
        aaaa: 123
  replicas: 3
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    aws:
      type1: aaa
      rootVolume1:
        aaaa: 123
  replicas: 3
metadata:
  creationTimestamp: null
  name: yunjiang-test1
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 10.0.0.0/16
  networkType: OpenShiftSDN
  a:
    b: 123
  serviceNetwork:
  - 172.30.0.0/16
platform:
  aws:
    region: us-east-2
publish: External


> output:

WARNING failed to parse first occurence of unknown field: failed to unmarshal install-config.yaml: error unmarshaling JSON: while decoding JSON: json: unknown field "aaaa"
INFO Attempting to unmarshal while ignoring unknown keys because strict unmarshaling failed
INFO Credentials loaded from the "de" profile in file "/home/cloud-user/.aws/credentials"
INFO Consuming Install Config from target directory
INFO Manifests created in: bb/manifests and bb/openshift

Comment 10 Aditya Narayanaswamy 2022-08-17 13:18:00 UTC
This is a known disadvantage to the golang library where only the first unknown field is printed. There seems to be no solution for this at the moment and we agreed to do it this way knowing the pitfall.

Comment 11 Patrick Dillon 2022-08-18 00:11:17 UTC
Moving this back to ON_QA due to https://bugzilla.redhat.com/show_bug.cgi?id=2098299#c10

Yunfei, thanks for checking this so thoroughly. Based on Aditya's comment, I think this is a limitation we should accept and is ultimately insignificant. Let us know if we need to do anything else here.

Comment 12 Yunfei Jiang 2022-08-18 08:49:07 UTC
The limitation is acceptable, moving to VERIFIED.

Comment 15 errata-xmlrpc 2023-01-17 19:50:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.