Description of problem: A particular arrangement of data in a File object in ignition will cause MCO to panic. Version-Release number of selected component (if applicable): OCP build: 4.6.0-0.nightly-2020-08-31-220837 $ oc adm release info --commits quay.io/ocpmetal/ocp-release@sha256:ffba977ed01cd239d2753b54ac37b563644f9212120adce508781a60e4668c3e | grep machine-config-operator machine-config-operator https://github.com/openshift/machine-config-operator 36f37f2d6009affe8174854f5ef5538e0cc49034 How reproducible: always Steps to Reproduce: 1. use openshift-baremetal-install to generate ignition files for bare metal provisioning 2. boot a host with bootstrap.ign Actual results: time=\"2020-09-14T21:48:21Z\" level=info msg=\"I0914 21:48:21.264243 1 update.go:1265] Writing file \\\"/etc/ignition-machine-config-encapsulated.json\\\"\\n\"\n time=\"2020-09-14T21:48:21Z\" level=info msg=\"I0914 21:48:21.265137 1 update.go:1265] Writing file \\\"/etc/motd\\\"\\n\"\n time=\"2020-09-14T21:48:21Z\" level=info msg="panic: runtime error: invalid memory address or nil pointer dereference\n[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x16833ea]\n\ngo> Sep 14 21:48:21 localhost installer[2420]: time="2020-09-14T21:48:21Z" level=info msg=", 0xc00025c401, 0x0, 0x0)\n\t/go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:528 +0x2cd\nmain.runStartCmd(0x2a43400, 0xc0006> Sep 14 21:48:21 localhost systemd[1]: libpod-6ade6eb4e758e0b467c9caf74c2c7e38831aaca2aad6b22449f8938613261936.scope: Consumed 281ms CPU time Sep 14 21:48:21 localhost installer[2420]: time="2020-09-14T21:48:21Z" level=error msg="Failed to extract ignition to disk" Sep 14 21:48:21 localhost installer[2420]: time="2020-09-14T21:48:21Z" level=error msg="Bootstrap failed failed executing nsenter [-t 1 -m -i -- podman run --net host --volume /:/rootfs:rw --volume /usr/bin/rpm-ostree:/usr/bin/rpm-ostree> Sep 14 21:48:21 localhost installer[2420]: time="2020-09-14T21:48:21Z" level=info msg="Exiting updateConfiguringStatus go routine" Additional info: I filed an issue on github that includes the ignition data that induces this bug plus a root cause for the bug including a specific line of code: https://github.com/openshift/machine-config-operator/issues/2086
Tentatively setting target to 4.7 as this is not a release blocker at the moment.
Can we have a full must gather ?
Just from reading code, this appears to be a regression since OCP 4.5. In that release, which used ignition 2.2, the "Source" field was not a pointer, so there was no opportunity for a nil pointer dereference. https://github.com/coreos/ignition/blob/v0.35.0/config/v2_2/types/schema.go#L54
> Can we have a full must gather ? We are seeing the bug during cluster installation, and it prevents installation from completing. So I can't run must-gather without the cluster.
One other note: There is an implied secondary bug underneath the segfault. You'll discover this as soon as you write "if file.Content.Source == nil {" and then think "ok what next?", but I thought it may help to point out here. The code assumes that an ignition File is constructed to overwrite any prior contents, using the File.Contents.Source field. But that is not a required field (hence the current attempt to read it causing the segfault), and instead a File can be configured to append content. In that case the content itself will show up in the array File.Append. You can see in the github issue that the File in question uses the Append array, and likewise MCO should presumably be able to carry that out. So the fix to this issue is likely to be: A) don't try to derefernce a nil pointer, B) as the alternative implement the ability to append content.
Ok, we had a discussion on this, moving back to 4.6 with higher priority to fix the panic. In summary, we realized that `append` has always been broken in the MCO since forever. We implicitly support it during install (since its valid ignition) but not during updates, so anytime we updated that cluster, the append sections would get dropped. This is because during updates we re-write all sections of the file, but ONLY source contents and never append. We only check for append and error in `reconcilable` function which only happens to diffs on upgrades, but not anything in the machineconfigs that existed since install time. In 4.5 this wouldn't even error because the contents will just be seen as empty. In 4.6 because its now a pointer we de-reference, it panics if you have it as empty. So our proposal is this: 4.6: 1. fix the panic by adding a nil check. 2. If your data.source is empty (maybe also check if data.append is not empty), add an explicit warning telling you not to use append 3. change baremetal-install binary to not use append, since it doesn't work anyways 4.7 re-evaluate how we should support append. Either 1. don't allow append (merge in spec 3) at all, not even during install time, explicitly 2. support append fully both during install and upgrades
So I applied a fix that would explicitly reject writing append sections, as well as add a nil check to remove the panic. So this won't fix the problem of an append being used since it will still fail. Looking at the installer a bit: https://github.com/openshift/installer/blame/7f8e8eab0b1a6fff4bf284957b7c390267c1a9d0/pkg/asset/ignition/bootstrap/bootstrap.go#L313 I'm not actually sure where the append is coming from. That's the only reference to append and its on bootstrap, which shouldn't involve the MCD, and we shouldn't have seen that error. Where are you actually getting the append section from?
That is the correct occurrence of append. The assisted-installer uses MCD on the bootstrap here: https://github.com/openshift/assisted-installer/blob/726f8a0/src/installer/installer.go#L183-L196
Ah, I see, so that's a conflict of operations. Hm... So the bootstrap node never interacts with the MCD normally, so we technically allowed "append" being used on bootstrap as it was a one-time provisioning from ignition binary. Mayhaps we should not allow it at all, but that's another discussion and it currently shouldn't break anything. The MCD never had any explicit support for append. If you tried to apply a pre-install config that has append, although the contents will be written by ignition on firstboot, the MCD will immediately complain when it tries to check the file on validation and degrade. This behaviour exists in all versions (4.5 and before) I guess since if we did use onceFrom we'd never have validated the node nor used the MCD later to write anything, so it would have been "fine" from a 4.5 perspective to use the assisted installer like this. Then it seems our options are: 1. explicitly disallow the use of append anywhere, which would be my PR + a change to either the installer or the assisted installer 2. explicitly error during checks for append, but only warn when you try to append during writes. This allows the onceFrom to go through. This would be a modification of my PR, but technically is "Wrong" from a normal operation perspective. This is the best stop-gap for 4.6 maybe Side note: Funnily enough if you try to provide a ign spec 2.2 machineconfig to append, it will actually fail the MCC on bootstrap. This feels like an error somewhere in translation and is probably a bug. I think the assisted installer would provide a spec 3 config so it shouldn't be a problem?
Also, if we do 2, we'd also have to actually add the ability to write files with "append", since we don't have that today. So it would either be 1. very hacky since append is not idempotent or 2. slightly complex to ensure some sort of idempotency or 3. we don't write appends at all for now, and assisted installer bm installs just wouldn't have that snippet in bootstrap MOTD I somewhat dislike 2 and I personally think should only be a last resort. 1 is better in practice, although I'm not sure what the cleanest way to fix that would be, since the point of the original MOTD write was so that it gets appended to the bootstrap node, I could be missing something but it seems like it would be harder to implement
Would it be possible for the assisted installer not to do this MCD-write for the bootstrap node? This wasn't ever really a supported operation
Leaning towards 1 from comment 13. Clearly not supporting something is more supportable than supporting something that's "Wrong" esp given that supporting the "wrong" behaviour via 2 reqs more code in MCO to support append that we don't support. 1 would also just require your PR and a change to assisted installer. Also the point (if I'm reading correctly) is that the assisted installer is operating in an unexpected way ie interacting with the MCD in bootstrap when it should probably be striving to stay consistent why how regular installer does things from a POV of maintainability (ie we expect that you do things one way but you are doing something completely different that we can't keep track of) and supportability. No?
Ok, after some more discussion let's merge the fix to get rid of the panic and make failures more explicit. See: https://github.com/openshift/machine-config-operator/pull/2123#issuecomment-701564641 To summarize, the underlying issue is not on the MCD operation. Append is not supported at all in ignition/files section. There is a workaround in the assisted installer now to remove that append section. In 4.7 we can fix the installer or the assisted installer further, to not use append preferably.
Lowering priority given the above. This is not a testblocker or upgradeblocker but is nice to fix in 4.6 without much risk.
SGTM
Verification Part 1: Bad append post-install $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2020-12-14-080124 True False 24s Cluster version is 4.7.0-0.nightly-2020-12-14-080124 $ cat 99_openshift-machineconfig_99-worker-motd.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: bad-append-3 spec: config: ignition: config: {} security: tls: {} timeouts: {} version: 3.1.0 storage: files: - mode: 420 path: /etc/motd append: - source: data:,hello%20world%0A $ oc create -f ../99_openshift-machineconfig_99-worker-motd.yaml machineconfig.machineconfiguration.openshift.io/bad-append-3 created $ oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master d6b5d1922d848885cf5d2737306ab14323b7783a 3.2.0 26m 00-worker d6b5d1922d848885cf5d2737306ab14323b7783a 3.2.0 26m 01-master-container-runtime d6b5d1922d848885cf5d2737306ab14323b7783a 3.2.0 26m 01-master-kubelet d6b5d1922d848885cf5d2737306ab14323b7783a 3.2.0 26m 01-worker-container-runtime d6b5d1922d848885cf5d2737306ab14323b7783a 3.2.0 26m 01-worker-kubelet d6b5d1922d848885cf5d2737306ab14323b7783a 3.2.0 26m 99-master-generated-registries d6b5d1922d848885cf5d2737306ab14323b7783a 3.2.0 26m 99-master-ssh 3.1.0 36m 99-worker-generated-registries d6b5d1922d848885cf5d2737306ab14323b7783a 3.2.0 26m 99-worker-ssh 3.1.0 36m bad-append-3 3.1.0 9s rendered-master-53ea1f220d2dc663af48dc3c6ec9680d d6b5d1922d848885cf5d2737306ab14323b7783a 3.2.0 26m rendered-worker-19950adde7f662bc39ed8ec6c140b89e d6b5d1922d848885cf5d2737306ab14323b7783a 3.2.0 4s rendered-worker-8f9c62dc20df06750a98214636b1f89f d6b5d1922d848885cf5d2737306ab14323b7783a 3.2.0 26m $ oc get mcp/worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-8f9c62dc20df06750a98214636b1f89f False True True 3 0 0 1 29m $ oc get mcp/worker -o yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: creationTimestamp: "2020-12-14T20:42:57Z" generation: 3 labels: machineconfiguration.openshift.io/mco-built-in: "" pools.operator.machineconfiguration.openshift.io/worker: "" managedFields: - apiVersion: machineconfiguration.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:labels: .: {} f:machineconfiguration.openshift.io/mco-built-in: {} f:pools.operator.machineconfiguration.openshift.io/worker: {} f:spec: .: {} f:configuration: {} f:machineConfigSelector: .: {} f:matchLabels: .: {} f:machineconfiguration.openshift.io/role: {} f:nodeSelector: .: {} f:matchLabels: .: {} f:node-role.kubernetes.io/worker: {} f:paused: {} manager: machine-config-operator operation: Update time: "2020-12-14T20:42:57Z" - apiVersion: machineconfiguration.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:spec: f:configuration: f:name: {} f:source: {} f:status: .: {} f:conditions: {} f:configuration: .: {} f:name: {} f:source: {} f:degradedMachineCount: {} f:machineCount: {} f:observedGeneration: {} f:readyMachineCount: {} f:unavailableMachineCount: {} f:updatedMachineCount: {} manager: machine-config-controller operation: Update time: "2020-12-14T21:10:09Z" name: worker resourceVersion: "31421" selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker uid: 898bb337-5f14-4b50-8951-5fec7a7a7390 spec: configuration: name: rendered-worker-19950adde7f662bc39ed8ec6c140b89e source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-worker - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-generated-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-ssh - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: bad-append-3 machineConfigSelector: matchLabels: machineconfiguration.openshift.io/role: worker nodeSelector: matchLabels: node-role.kubernetes.io/worker: "" paused: false status: conditions: - lastTransitionTime: "2020-12-14T20:43:48Z" message: "" reason: "" status: "False" type: RenderDegraded - lastTransitionTime: "2020-12-14T21:10:04Z" message: "" reason: "" status: "False" type: Updated - lastTransitionTime: "2020-12-14T21:10:04Z" message: All nodes are updating to rendered-worker-19950adde7f662bc39ed8ec6c140b89e reason: "" status: "True" type: Updating - lastTransitionTime: "2020-12-14T21:10:09Z" message: 'Node ip-10-0-177-88.us-west-2.compute.internal is reporting: "can''t reconcile config rendered-worker-8f9c62dc20df06750a98214636b1f89f with rendered-worker-19950adde7f662bc39ed8ec6c140b89e: ignition file /etc/motd includes append: unreconcilable"' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded - lastTransitionTime: "2020-12-14T21:10:09Z" message: "" reason: "" status: "True" type: Degraded configuration: name: rendered-worker-8f9c62dc20df06750a98214636b1f89f source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-worker - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-generated-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-ssh degradedMachineCount: 1 machineCount: 3 observedGeneration: 3 readyMachineCount: 0 unavailableMachineCount: 0 updatedMachineCount: 0 $
Pre-install verification $ openshift-install create manifests $ cat << EOF > 99_openshift-machineconfig_99-worker-motd.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: bad-append-3 spec: config: ignition: config: {} security: tls: {} timeouts: {} version: 3.1.0 storage: files: - mode: 420 path: /etc/motd append: - source: data:,hello%20world%0A EOF $ cp 99_openshift-machineconfig_99-worker-motd.yaml openshift/ $ openshift-install create cluster Cluster fails with the append error at the very last step. Worker machine config pool is degraded.
No doc needed: this simply made error reporting more clear, does not change behaviour
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633