Hide Forgot
Thanks for opening a bug report! Before hitting the button, please fill in as much of the template below as you can. If you leave out information, it's harder to help you. Be ready for follow-up questions, and please respond in a timely manner. If we can't reproduce a bug we might close your issue. If we're wrong, PLEASE feel free to reopen it and explain why. Version: $ openshift-install version vendored dep: 4.11 pre-GA installer binary: 4.10 Platform: AWS Please specify: IPI (via hive) What happened? Hive needs to vendor the installer to do things like generate machinepool objects in the install-config [1]. It then invokes the installer binary for whatever OCP version is to be installed. We can only vendor one version of the installer. Recently we bumped our dependency [2] to bdbb3b1 to pick up a new field for azure. That PR merged because our CI was still testing against 4.9, which still uses lenient unmarshaling of the install-config [3]. Then we bumped our CI to use 4.10 [4] and started failing with ``` level=fatal msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to unmarshal install-config.yaml: failed to parse first occurence of unknown field: error unmarshaling JSON: while decoding JSON: json: unknown field "metadataService" ``` This is happening due to the combination of 4.10 having made unmarshaling strict [5][6] and the recent (pre-GA, but before bdbb3b1) addition of that field to the AWS MachinePool struct [7]: Hive constructs the object (without that field) [1] and marshals it into the install-config [8]. But since the new field in our vendored version of the installer doesn't have `omitempty`, the marshaler is adding it with its nil value. This install-config gets to the 4.10 installer, which *has* strict unmarshaling but *lacks* that field, and kaboom. We'd like some more time to think about it, but we think the proper solution may simply be to add `omitempty` to this field (and any other that's added to the same GVK of an object). [1] https://github.com/openshift/hive/blob/f9e6e206d5c6d5f50282ea09d4711edc8ad045c5/pkg/clusterresource/aws.go#L118-L129 [2] https://github.com/openshift/hive/pull/1764 [3] https://github.com/openshift/installer/blob/release-4.9/pkg/asset/installconfig/installconfig.go#L126 [4] https://github.com/openshift/release/pull/29603 [5] https://github.com/openshift/installer/pull/5307 [6] https://github.com/openshift/installer/blob/release-4.10/pkg/asset/installconfig/installconfig.go#L128 [7] https://github.com/openshift/installer/pull/5793/files#diff-ebf1b2fb6a5229b36ee6bffb2348a61c75b4c77a30d89f5b3e510015ffe908c7R31 [8] https://github.com/openshift/hive/blob/f9e6e206d5c6d5f50282ea09d4711edc8ad045c5/pkg/clusterresource/builder.go#L404 How to reproduce it (as minimally and precisely as possible)? See CI results: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_hive/1799/pull-ci-openshift-hive-master-e2e/1537921903648313344#1:build-log.txt%3A336
> add `omitempty` to this field I've submitted https://github.com/openshift/installer/pull/6028 to do that. (It's not linking to this bug, which probably means I spelled things wrong -- I don't work with BZ much.)
We're discussing this in slack [1]. In summary: - Adding `omitempty` is not sufficient to address the problem -- see [2]. The new fields must *also* be converted to pointers. I have updated https://github.com/openshift/installer/pull/6028 to do that for the EC2Metadata field. However... - There are other fields for which this problem manifests. We don't know all of them. But at least Azure OSImage [3] is affected. (Note that this one is already `omitempty`, but is still a problem because it's not a pointer.) I've added a commit for this field to the PR. - It is under consideration to simply revert the switch to UnmarshalStrict [4]. This would be the simplest solution, and guaranteed (?) to cover all cases, since we wouldn't need to root out individual fields. - Alternatively, it would be lovely to have unit tests that detect this class of problem so we can a) identify existing instances, and b) fail CI, preventing future violations. The challenge here is that such a test needs to operate across releases/commits, at least to address the former. For the latter, it may be possible to simply look for new fields and enforce that they're pointers with `omitempty`. Though that gets complicated for sub-structures, whose contents don't necessarily want/need to conform to that rule. [1] https://coreos.slack.com/archives/C68TNFWA2/p1655842108884639 [2] https://github.com/golang/go/issues/11939 [3] https://github.com/openshift/installer/blob/599c160a4e1c48ded0980955bc475ece0268083f/pkg/types/azure/machinepool.go#L45 [4] https://github.com/openshift/installer/pull/5307
The target release of this BZ is 4.11, but linked PR https://github.com/openshift/installer/pull/6039 is on master branch (4.12).
Updated target release to 4.12 and moved this back to ON_QA.
Tested against serval unknown fields in install-config.yaml, only one of them was printed in warning message. I expect all unknown fields printed in warning message. OCP Version 4.12.0-0.nightly-2022-07-27-133042 > tested with following config: apiVersion: v1 baseDomain: qe.devcluster.openshift.com compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: aws: type1: aaa rootVolume: aaaa: 123 replicas: 3 controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: aws: type1: aaa rootVolume1: aaaa: 123 replicas: 3 metadata: creationTimestamp: null name: yunjiang-test1 networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 machineNetwork: - cidr: 10.0.0.0/16 networkType: OpenShiftSDN a: b: 123 serviceNetwork: - 172.30.0.0/16 platform: aws: region: us-east-2 publish: External > output: WARNING failed to parse first occurence of unknown field: failed to unmarshal install-config.yaml: error unmarshaling JSON: while decoding JSON: json: unknown field "aaaa" INFO Attempting to unmarshal while ignoring unknown keys because strict unmarshaling failed INFO Credentials loaded from the "de" profile in file "/home/cloud-user/.aws/credentials" INFO Consuming Install Config from target directory INFO Manifests created in: bb/manifests and bb/openshift
This is a known disadvantage to the golang library where only the first unknown field is printed. There seems to be no solution for this at the moment and we agreed to do it this way knowing the pitfall.
Moving this back to ON_QA due to https://bugzilla.redhat.com/show_bug.cgi?id=2098299#c10 Yunfei, thanks for checking this so thoroughly. Based on Aditya's comment, I think this is a limitation we should accept and is ultimately insignificant. Let us know if we need to do anything else here.
The limitation is acceptable, moving to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399